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Figure 1.3-1: 8-Server Chassis Example 



1.4.2 MAC Features 



PCI Express VI .Oa compliant 

Can be configured to operate as a Shared Port or a Non-Shared Port. 
Layered architecture with Configuration Block, Presentation Layer, Transaction 
Layer, Data Link Layer, and Logical Physical Layer. 
Supports xl, x2, x4, and x8 links 

Supports up to 16 OS Domains and up to 8 Virtual Channels with a maximum of 
16 different OSD/VC combinations. 
Provides a 64-bit data path at 250MHz 

Supports configurable maximum packet sizes in the range of 128B to 1KB 
Performs PHY level link negotiations 
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• The MAC is divided into 8 sub-modules, where each sub-module consists of 8 
physical lanes that can be negotiated to operate as 1 x8 link, or 2 links of xl, x2, 
or x4 lane widths 

• 64KB of Receive Data Buffer and 6 1 44 Bytes of Header Buffer (256 outstanding 
transactions per port) 

• Supports cannibalization of Receive Data Buffers and Receive Header Buffers. 
An 8x link can use all 128KB of Receive Data Buffer and all 512 Header Credits. 

• 8-bit Single Error Correction and Double Error Detection (SECDED) on Rx Data 
& Header Buffers 

Maintains Transaction Ordering within each egress port per OSD/ 
Add Power States supported 
Add latency statements (mention ' fast path' feature) 



1,4-3 Switch Core Features 




The switch core is responsible for forwarding transacti^^pa^ket dat^frpm the 16 RX 
MAC interfaces over to the 16 TX MAC interfaces, jjjifc wi1l%e implemented as a data 
crossbar and a transaction scheduler. The high level functions "<l|the switch core are: 



Implements a transaction scheduler that ^fowsW programmable fairness at the 
different arbitration levels (per input O^A^.pe^>^put port per VC) that transfers 
one transaction per clock from a source tc^destiAation 

Provides virtual output queue Q^d^^^ctd^^ohat each input feeds its own virtual 
switch from a software configui^^n viewpoint 

Implements a data crossbaf^|t effi|knw moves the packet data from the 16 RX 
MACs over to the 16 prt^Sk %T 

Provides an interfa^e^^he^it^panagement logic for transactions that terminate 
in our device 

Lookup interface for PO bri|ge routing tables (support for address routing, ID-based 
bus routing, S^^fiM^uf ing) 



icro Architecture 
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2.1 D tailed Block Diagram 

2.1.1 Chip L vel Block Diagram 

2.1.2 Block Descriptions 

2.2 System View of Chip 

2.2.1 PCI-Bus enumeration - Prashantha 

2.2.2 Transaction Routing 

According to the PCI Express Base Spec (la), there are three ways t^gute a patketf 
address, ID, or implicit. The type of routing used is dependent pn the%y^||e^^f the 




header (and the routing sub-field, r[2:0], of the Type field foj^fi^age t^nsartions). 
Each routing type will be covered in the next few sections^ 

In PCI Express, each switch is logically a set of virtu4^2P Bilges cdnnected as shown 
below. 
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Endpoint 
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Endpoint 
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Figure 2.2-1: 
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This topology is a simplified version of our switch since it only shows 3 ports instead of 
16, but it illustrates the idea by having 4 virtual P2P bridges, and 5 PCI buses, numbered 
2 through 6. This topology will be used to show examples of how our routing will work 
and also shows where our on-chip device, the Common On-chip Processor (COP), will be 
located in the topology once discovery is completed by the root complex. 



2.2.3 P2P bridges 

A P2P bridge forwards memory and I/O transfers from one PCI bus over to the other side 
of the bridge where another PCI bus resides. Each P2P bridge has a 256Biieader register 
space that is accessible by the system and is used to set up all the paramefflybr the v 
bridge. The following diagram shows the bridge header (Also calle^a Type i%eadei| 
with the fields highlighted that apply to transaction routing. 
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Interrupt Pin 


Interrupt Line 



V Figure 2.2-2: 

All of these re^i^erj/are documented in the P2P Bridge spec (pp. 26-54), with some 
changes madejnPCI Express (base spec la, section 7.5.3, pp. 327-330). The addressing 
start^the tog right of the table as register address 0x0 (lower 8 bits of Vendor ID) and 
continuesl^tfie bottom left as register 0x3F (upper 8 bits of Bridge Control). 



2.2.4 Address routing 

Address routing is used for memory (32-bit and 64-bit) and I/O transactions that must 
pass through our switch. Each will be discussed in detail in the next few sections. 

The header fields important to address routing are shown in the next two diagrams. 

+0 | +1 | +2 | +3 
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ByteO 



Byte 4 
Byte 8 
Byte 12 



7|<5|5|4j3 l2|l|0|7|<5ls|4[3l2|l |o|7|6|5l4l3[2|l|o|7 r |<5|5|4|3|2ll|o 




Figure 2.2-3: Header format for 64-bit Transactions (prefetchable memory) 
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Figure 2.2-4: Header format for 32-bit Transactions ftemory aJSil I/O) 




2.2.4.1.1 Memory transactions jj^ 

In P2P bridges, there are 2 different adAgss rMges th%t can be defined, the 32-bit 
memory range (which is required)and the^^bit^pmory range (which is optional). Our 
switch will implement both rangesTtlyhe brifge header registers for both memory 
transaction types will need tj^<%|rt th#system. The memory TLP in PCI express 
does not specify which of|Re two r^es%#ansaction will fit into (the SAC and DAC 
address cycles don't e^s^^e^ey dd|^PCI) 3 so both the memory range and the 
prefetchable memory rlnge mjist be compared against for a 32-bit transaction since the 
prefetchable rangjy^m be^elo\^GB. A memory transaction is defined as follows: 



TLP Tygl%^ 


V Fmt[l:01 


Type[4:01 


Description 


MRi n 

) 


, y oo 
! oi 


0 0000 


Memory read 


^MRdLi^^ 


00 
01 


0 0001 


Memory read-locked 




10 

n 


0 0000 


Memory write 



Table 2.2-1: 



The LSB in in the Fmt[l :0] field each of the three memory request types is high if a 64- 
bit address is present and low for a 32-bit address. 



Each memory request (for decode purposes of the bridge) is addressing 1 MB of memory, 
so the lower 20 bits are assumed to be zero to match the memory space to the base/limit 
range. Note that any memory transaction that is addressing less than 4 GB is always a 
32-bit transaction. Only transactions above 4 GB may use 64-bit address mode TLP and 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 5 of 222 



NextIO, Inc. 

© All Rights Reserved. 



NEXSIS Overview Document 
V0.8 



are defined as prefetchable transactions that use the prefetchable base/limit registers. 
Following is a diagram that shows one way the system could provision the two memory 
address ranges. 



Primary bus 



Secondary bus 



4GB Boundary — 




Y Figure 2.2-5: 

In th|$^i%r^ the address 0x40 0000 going from north to south would still get through. 
Alsc^the addre#§d:0_0000_0000 would go through from north to south. The address 
OxFB^MFFFl|_FFFF_FFFF from south to north would also pass across the bridge in this 

configi^^gpr 

Bottom line is that for 32-bit memory transactions, both the standard memory range, the 
prefetchable range, and the memory-mapped BAR space must be examined to process a 
transaction. For 64-bit transactions, only the prefetchable range must be examined. 

2.2.4.1.2 I/O transactions 

I/O transactions are limited to a 32-bit space, with a base and a limit being specified 
within the P2P bridge header just like memory transactions. The following fields are 
used to define the I/O TLP: 
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TLP Type 


Fmt[l:0] 


Type[4:0] 


Description 


IORd 


00 


0 0010 


I/O read request 


IOWr 


10 


0 0010 


I/O write request 



Table 2.2-2: 



This type of access is 4kB-aligned (lower 12 bits are assumed to be zero for matching the 
range), so only the upper 20 bits of an I/O transfer are compared for a match in the 
address range. For transactions traveling downstream, the I/O address musrt match within 
the I/O range, and the transaction will be forwarded downstream. For trai^actions 
travelling upstream, the address must match outside the range, and the tran^l^on w|| be 
forwarded upstream. The following diagram shows how a sample l/%|ddress b^&^mit 
pair can be set up, using base of 0x20_0000 and a limit of 0x7F^FFFI\ 
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Figure 2.2-6: 

For example, using this configuration, a transaction of 0x40_0000 is presented on the 
bridge's primary bus interface. Since it matches within the range programmed in the I/O 
base/limit registers, the transaction is forwarded downstream to the bridge's secondary 
bus. Next, a transaction for address 0x0 is presented on the bridge's secondary bus. 
Since this is NOT within the base/limit range, this transaction is forwarded up to the 
bridge's primary bus. In order to disable this range, the limit must be programmed to a 
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smaller number than the base. This means all downstream transactions will be rejected, 
and all upstream transactions will be allowed through. In the event of an error (an 
address of 0x10 0000 is presented at the primary bus for example), the transaction is 
discarded and an Unsupported Request (UR) message must be sent back out the port. 



2.2.4.1.3 Peer to Peer support 

Upstream traffic in a bridge is only allowed to cross the bridge if it's outside the 
{base,limit} range as described above in the examples. This would mean that if a 
downstream port initiates a read or write request, the address would need to fall outside 
the ranges of that port's P2P bridge. 




Assuming it's outside the range, the address is then compared with tHplj&rt's adt 
range and the other endpoint's address ranges. For the root's raftge, thfe a^r^must 
again be outside the {base,limit} tuple. If it is, the transactj|^is%^rwar%5d up to the root. 
If the address is within the {base,limit} range, the other ^^point's%pge^are checked for 
downstream matches. If a match is found, the transacj^i ^Wward^to the other 
endpoint. If no match is found, the bridge issues a UR^acketBack to'the endpoint. 



The same mechanism will be used for down^^an^^^Pti^||gr support as well. The 
Root will be compared first since it is the ljjcfely recipient, but if no match is found the 
other downstream ports will be checked t()%|^me tpife&ction should be routed there 
instead. 




So, our architecture will be ablggo sl^gort IP|| (inter-processor communication) and 
downstream peer-to-peer if^l^ERON^rgyVisions our switch such that two roots can 
appear on the same hierarchy. v Ou?l^re^^lookup_module logic does not preclude this 



v Our£||dres 

from happening. WeM^fi^^^h deftn^cl per base/limit pair that allows for a very 
flexible mapping. This%|t wiij^efine each port as either upstream or downstream for 
each port for ea ^te^^J^s alt^ws the address routing logic to map addresses into the 
ranges and aUgg; fo?H^pstr^a^peer-to-peer and downstream peer-to-peer in the same 
architecture? f hi^|^6tfi^urable by the EEPROM as to which ports "appear" on each 
OSD durinLdevice t|scd^ery. Note that only the trusted software entity (either a driver 
runnu*g%n 6|^OSD jpr the I2C interface software) is allowed to modify the routing bits, 
so Q|>Ds will iWftbe^able to make other OSDs "appear" on their PCI hierarchy by 
accident. Anyjjpeer to peer configuration is application specific on how it handles the 
addres%mapjfig. 

2.2.5 ID routing 

ID routing is used for configuration request TLPs, completion TLPs, and optionally 
vendor-defined messages. The TLP header shown is only 12 bytes, but some ID routed 
packets can have a 16 byte header depending on the TLP type. 



+0 


+1 


+2 


+3 


7 | 6 | 5 | 4| 3 | 2 | 1 | 0 


7 | 6| 5| 4| 3| 2\ 1 | 0 


7|6|5|4|3|2|l|0 


7 | 6| 5| 4| 3 | 2 | l| 0 
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Bmsm 
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Attr 
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fhese bytes depend on TLP type. 



Byte 0 

Byte 4 
Byte 8 




Function 
Number 



These bytes depend on TLP type. 



Figure 2.2-7: Header format for ID-based Transactions 



For this type of transaction, the bus_number[7:0] field is used to determine„,which port a 
TLP needs to be sent to. If the bus_number field indicates one of our swig's VP2P 
bridge headers is being addressed, the COP will use the device number fielcftkdetei 
which header is being addressed and take the appropriate action. 



The header fields are shown in the table below for these transae&n typ$ 




ine 



TLP Type 


Fmt[l:0] 


Type[4:0] 


description 


CfgRdO 


00 


0 0100 - 


4^ x ^gnfigi^ration Read Type 0 


CfgWrO 


10 


0 0100 


^ C^ffguration Write Type 0 


CfgRdl 


00 


ooioj^ 


^^^^Configuration Read Type 1 


CfgWrl 


10 


0 Q1W\ 


V ^Configuration Write Type 1 


Msg 


01 




, "V* Message Request - Routed by ID 


MsgD 


11 

d 




p Message Request with data payload - 
routed by ID 


Cpl 


00 J 


V 101) ' 


Completion Without Data -Used for I/O 
and Configuration Write Completions and 
Read Completions (I/O, Configuration, or 
Memory) with Completion Status other than 
Successful Completion 


CplD 




0 1010 


Completion with Data - Used for Memory, 
I/O, and Configuration Read Completions. 


CplLk f 

k 4 




01011 


Completion for Locked Memory Read 
without Data - Used only in error case. 


CpFtfBk ^| 


hue*/ 


01011 


Completion for Locked Memory Read - 
otherwise like CplD. 




Table 2.2r3: 



Note that configuration requests will always be sent to the COP for processing. The 
functionjiumber fields shouldn't need to be used by the switch since we won't support 
any multi-function devices. 



For messages and completions, the bus number is compared against the programmed 
secondary and subordinate bus numbers for the port's P2P bridge header. An example is 
shown below from a software topology point of view. 
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Endpoint 



Figure 2.2-8: 

Example 1 

A configuration packet arrives on port 1 . 

• If the packet is a type 0 configuration/p^ke^tjie /^ddress_lookup_module 
immediately returns port 16 as the destinfton sinle this packet will be handled by the 
COP. 

• If the packet is a type 1 configuration packet, *he Bus Number field of the header is 
compared with Port Fs KP^jdge^|pon|ary and subordinate bus numbers. If (bus 
number <3) or (bus nupfSei* > /fyhe '^fftsaction is dropped and a UR transaction is 
signaled back to the,4f§J\4|j^ sirllyt'isn't in the range of the 

{ secondary subordinate ^^ple . 

• Else if the Bu|^^mbe^ielols^equal to 3 (the secondary bus number of the P2P 
bridge header),l^pMk^s Addressing one of the downstream virtual P2P bridges 
inside t^^^.l^gain, the Address_lookup_module returns port 16, and the packet 
is sent & the C^. Y 

• r^ef^^^econdary,subordinate} tuples of all 16 destination ports are compared to 
see which^Fijthis packet should be routed to (ie, secondary_bus[port 3] < 
p&cket_bus| number <= subordinate_bus[port 3], etc.). Once a match is found, the 
desfatiotfport is returned. Note this could still return port 16 which would mean the 



BIOS is addressing the COP's type 0 header space. 



Subordinate bus 



0x7 



2.2.6 Implicit routing 

Message transaction are the only type that can use implicit routing, and the port logic 
should always examine the r[2:0] sub-field for message packets to see how to handle 
them. The following table shows the decoding of the various values. 



r[2:01 | Definition 
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000 


Routed to Root Complex 


001 . 


Routed by Address 


010 


Routed by ID 


Oil 


Broadcast from Root Complex 


100 


Local - Terminate at Receiver 


101 


Gathered and Routed to Root Complex (see section 5.3.3.2.1 of 
Base Spec) 


110-111 


Reserved - Terminate at Receiver 



Table 2.2-4: 



If a messages transaction's value of r[2:0] is either 001 (address routing)^gl0 (ID , 
routing), the port logic will follow the same sequence of events listed above n^gectiojps 
2.2.4 and 2.2.5. If the value is one of the other values, the route is "i%fi[£it" siil|%$ie 
definition of the encoding tells the port logic what to do. £ 

2.2.6. LI Routed to Root complex 



Our chip will have a rootcomplex field assigned to eacri inp^t port based on the source 
OSD to handle this type of packet. The Addressjool&p modufewilf simply index the 
root table (by OSD number) and return the rootj^gtnumb^r to t\\i MAC. 

2.2.6.1.2 Broadcast from Root Comple^ 

This only applies to two types of TLPs, PMS 
actually be broadcast (which is legal^atMt^nf 
which port has been locked and unlock the lfk>ra 



rh MJBhd Unlock. Unlock will not 
th^spec) since our switch will track 
ate queues. 

PME_Turn_Off will be handled B^the atoefs lookup module logic and the COP. The 
lookup logic 

Local - Tef 



2. 2. 6. 1.3 Local - Tefminatt^gt Receiver 

This type of packe^iMit|^^(s lereturned from the Addressjookupjnodule as routed 
to port 16 fo^theXOl^to handle (ports 0 through 15 are the actual data ports). 

2.2.6.h^ Gapteredland Routed to Root Complex 

This|type^f fl^in^g^l only used whenever the switch receives a PME_TO_Ack 
mes§ 
16. 



ges from downstream ports. The Addressjookupjnodule will again return port 
^COPwill scoreboard the responses from all downstream ports. Once the COP 
has receYvldCPME TO Ack packet from each downstream port, it then returns a single 
PME_TO_Ack packet back to the root complex and sets the r[2:0] sub-field to 3'bl01 to 
tell the root complex that all downstream ports in the switch responded to the 
PMETurnOff message. 

Note that a timer must also be implemented in this logic to avoid deadlock, since the 
return of the PMC_TO_Ack packet back to the Root should not be blocked due to one 
device's failure to send a PME_TO_Ack in a reasonable amount of time (no time given 
in the spec for this, but 100 ms is mentioned elsewhere as a timeout number, so maybe 
we'll use that). 
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2.2.6.1.5 Reserved - Terminate at Receiver 

These are nearly identical to the Local - Terminate at Receiver types of routing. This 
transaction will return port 16 from the Address_lookup_module and will be sent to the 
COP. 

2.2.7 Isolation of OS Domains 

As TLPs are received on a Rx MAC, they are queued into a structure that is separated by 
OS domain. PCI-EX ordering rules are always adhered to within an OS dgrnain but 
never across OS domains to avoid head-of-line blocking conditions. Flowlfontrol is x also 
isolated by OS domain and VC. The PCI-EX base specification defiges flow ^^trqifbn a 
per VC basis. The switch is configured to allow assignment of VC r^p&||ps penPS 
domain to enable absolute flow control (i.e. flow control per Y|^ er O^ ion^^M). 
Therefore, from the Rx MAC perspective, each OS domairyiallc^ed itfvpwn buffer, 
queue, and flow control resources. j^L W 



In the Switch Core, all transactions are arbitrated usingte simpl^pund-robin algorithm, 
where fairness is enforced at 3 different levels -^rt^rbi%tion, WC arbitration, and OSD 
arbitration. , K V^ 58 ^ 





The COP manages the different OS domm^||^peari^g as a Type 0 "shared" I/O 
device. After switch configuration i|^m^te%|eaclnloot Port is assigned to a 
particular OS domain, and for eaQh^oot P^l|{)u^hat ports are targets on that bus and 
their corresponding port/OSD^^y 6%^RoolPort has no knowledge of the presence of 
any other Root Port. If a R^Pcf^us^^fe a shared I/O port, it has no knowledge of 
the other OS domains tha||an acceilkhe Shared I/O port. Therefore, at the COP level, 
OS domains are compiae^^fated. Y 

\ \ 

One exception tai^tj^^^le jtfa scenario where the console management software is 
running on a^^d\^ade differ. In that case, the OSD is allowed to manage the device- 
specific registers %^c^|sing them through the COP's type 0 header space. To become 
a trusted bl^e server|th/driver on that OSD must present a key to the switch. If that 
key matc^es^e trusted key, the COP sets a bit that tells that OSD it can now access our 
devi^-specifi(^f^feters. 

2.3 Baia Flow Examples 
2.3.1 Address-based Requests 

All Memory transactions (Reads, Writes, and Completions) and I/O transactions (Reads, 
Writes, and Completions) are address-based. The following describes the data flow of an 
address-based TLP: 

1 . The initiator generates a Posted Request. 

2. The Rx MAC receives the Posted Request from the SERDES. 

3. The Rx Physical Layer Module (RxPHY) performs 8b/10b decoding, de- 
scrambles the data stream, and performs clock compensation between the 
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extracted clock and the core clock. The data stream is in the form of 8 bits per 
clock (at 250 MHz.). It will also perform lane-to-lane deskew if lane width is 
greater than one. 

4. The Rx Data Link Layer Module (RxDLLM) converts the data from the RxPHY 
to a 64-bit data path at 250MHz regardless of the link width. For example, if the 
link width is xl, it will take 8 elks to gather 64 bits at 250 MHz. The Rx DLLM 
then checks the Sequence Number and LCRC as the TLP is passing through to the 
Transaction Layer. The Rx DLLM also removes the Sequence Number and the 
LCRC from the TLP before it forwards the packet to the Rx Transaction Layer 
Module (TLM). 

5. The RxTLM performs TC to VC mapping and OS domain identifif||ion to x 
determine if there are enough header and data resources for the particutoTLp*; If 
there is, the TLP is stored in the header buffer and the data bufcyif theF^P 
contains payload) and the starting address of the Headers for^ad^to^the 
Address Lookup FIFO. All flow control calculation^^||mplen|Cnte^n the Rx 
TLM and are scheduled to transmit every time a T3# is stored or&rFC credit is 
de-allocated. 

6. The Rx TLM stores the header address of eaeft^TLP iri^i||^Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a^ansaction to the 
Address Lookup Module in the Switc^d^^Pfea^he header address stored in 
the Address Lookup FIFO to acces&tTO routing information (in this case, the 32- 
bit or 64-bit address) from the He^aB^Ser. tjje following information is 
passed into the Lookup Modujga^^^^ \ 



address[43:0] - corj^ins eifrmth^j^per 44 bits of a 64-bit memory 
transaction, the uppei%2 bits qf a 32-bit memory transaction, or the upper 
20bitsofanJ/^%sac%^/ 

lookup_t)^[2:Q] - fidd i§:*tised to specify the transaction type as shown 
in the fdlcftmng&ble. 



highlig^:^ 



te types relative to<this transaction type are 




<C™i&p'type[2:01 


Transaction definition 






miiiHi 


|0^^t]m , eTn^^Fai?Sa^o^9| 


I 3'b010 


ID-based transaction 


iniJiLdlMHI 


iBtaP^^tl^/MtMn'sWronvvi 


3'blOO 


. Routed to root complex 


3'bl01 


Broadcast from root complex 


3'bll0 


Terminate at receiver 


3'blll 


Reserved 



• port_is_downstream - lets the Address Lookup Module know what the 
most likely lookup sequence should be (routed to root complex). 

• tc[2:0] - used to help determine the egress_qid[3:0] for this transaction. 

• osd[3:0] - this is used in conjunction with the tc[2:0] field to determine 
the egress_qid[3:0]. 
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Fast Path 

If the Address Lookup FIFO is empty, the routing information is immediately 
presented to the Address Lookup Module in the Switch Core. 



7. The Address Lookup Module(ALM) figures out the root port and which ports are 
connected to this ingress port. Then it begins walking the entries (up to 16) to 
find the base/limit pair that matches the address range, based on the array of valid 
ports to search. Note that the root entry is searched first if the 
'port_is_downstream' bit is set. The ALM also determines the egress QID 
number (ranging from 0 to 15). 

8. The ALM returns the egress_qid[3:0] and destination_port[4:0] tc^e Address 
Lookup Interface in the RxTLM. The transaction will not be submit%i to thek 
Transaction Queues until there is enough data, or until the ent^e packeftkst^red 
when the egress_cut_through_enbl is not set. v'^SSfr. 

9. When the Address Lookup Module returns the egress^q|fl.[3:0] apa ^ 
destination_port[4:0], the Address Lookup Interfac^4tor^^e iri^rmation in a 32 
deep FIFO. This ALM Response FIFO is necess^^i casel^ J^nsaction 
Queues are not able to accept the transactions^^fast a^e AdBress Lookup 
Module Interface is able to submit them. Wherf^ns Flf^becomes half full, it 
backpressures the Address Lookup MocMMe^^j^ This means that the 
Address Lookup Module Interface wjffftot fs§ue affphiore requests to the Address 
Lookup Module until the ALM Rp%ons^FIF<^s^less than half full. 

10. The state of each transaction queue (^g|fty or^ion-empty) in the Rx PM is sent to 
the Switch Core. The Swit^So^^|li\uep^he Presentation Module asking for 
the next transaction in a pat%ylar quljue. jThe Presentation Module will resolve 
the transaction ordering^d re *to a jacket header to the Switch Core. The 
Switch Core will cl\ecfc flo^|con^yfCredits and will queue the transaction unless 
it does not pass co^Jrol g&tipg. The Switch Core will use the query 
information to^ventilatfy^requ^st th 



. the Presentation Module to transmit a packet. 
1 1 . The Prese^ation^ayeH^p read the packet information and pop the packet out of 



the TranstipnSi^e. ICwill pass the packet information to the packet 

schedutefe^ \ 

12. TheiPacke%%:h^uler will take the packet information, and create a TLP by 
J^ki^the ordinal TLP header and appending the TLP data. This packet is sent to 
the I^^to^r. 

Once the Switch Data Mover starts transferring the data, the TLP is stored in the 
[X TL|i Mini-FIFO 64 bits at a time. As the TLP is read from the Mini-FIFO, 
feSffared 10 Header is inserted before the TLP Base Header (if the endpoint is a 
shared I/O). This is fed into the Tx Data Link Layer Module (DLLM). 

14. The Tx DLLM receives the TLP from the Mini-FIFO and starts calculating the 
LCRC along with appending the Sequence Number to the start of the TLP. 

15. The TLP is forwarded to the Tx Physical Layer Module (PLM) where it is 
scrambled and decoded from 8bits to 1 Obits and sent out on the wire. 

2.3.2 Configuration R qu sts and Completions 

Configuration Request and Completions follow the same steps in the previous section 
except that they are ID based transactions. Steps 6 and 7 would be replaced with: 
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The Rx TLM stores the header address of each TLP in the Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a transaction to the 
Address Lookup Module in the Switch Core, it uses the header address stored in 
the Address Lookup FIFO to access the routing information (in this case, the bus 
number) from the Header Buffer. The following information is passed into the 
Lookup Module: 

• address[43 :0] - contains the 8-bit bus number. 

• lookup Jype [2:0] - field is used to specify the transaction type as shown 
in the following table. The types relative to this transaction type are 



Lookup type[2:0] 


Transaction definitidrP^ 


3'b000 


32-bit memory transaction M % 


3'bOOl 


64-bit memory tranla&tign 


; wwm 


HMiiiyi^ase^Kransaeta^nW 


3'b011 


32-bit I^trSb§acti5b 


3 5 bl00 


Routedfb root 3ttS.pl^ 


3'bl01 


BroacJef§t frofe root cpriplex 


3'bll0 


fSkninate^neceiver 


3'blll 






portisdownstream - lets^tfce Address\ookup Module know what the 
most likely lookup sequencel^ilcf b^(routed to root complex). 
tc[2:0] - used to help/d^^^e t^e ^ess_qid[3:0] for this transaction. 
osd[3:0] - this is ^1%jn conj%ctiB^ with the tc[2:0] field to determine 
the egress_qid[3l§]. 





Fast Path 
If the Addresl 



FIFO is empty, the routing information is immediately 
Lookup Module in the Switch Core. 



The^ddre^^o% t kup Module(ALM) figures out the root port and which ports are 
|on^cted to%is Ingress port. Then it begins walking the entries (up to 16) in t he 
^ 5 Ms_ri|piber _Jbokup_table to find the base/limit pair that matches the address 
range, iiliSd^n the array of valid ports to search. Note that the root entry is 
. searched first if the 'port_is_downstream' bit is set. The ALM also determines 
%g.ggrlss QID number (ranging from 0 to 1 5). If the ALM determines that a 
particular configuration request is targeted for the COP, it will notify the Address 
Lookup Interface that the TLP configuration type should be changed from type 1 
to type 0 when the TLP is forwarded to the Data Mover. 

2.3.3 Message Requests 

Each type of Message Request behaves differently regarding data flow. The following 
sections describe the various types of Message Requests and how it differs from the data 
flow example in Section 2.3.1. 
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2.3.3.1 .1 INTx Interrupt Signaling Message Requests 

The INTx virtual wire interrupt signaling mechanism is used to support legacy Endpoints 
in cases where the Message Signaled Interrupt mechanism cannot be used. All INTx 
messages are routed to the Root Complex. 

INTx Messages follow the same steps in the previous section except for steps 6 and 7 
would be replaced with: 

6. The Rx TLM stores the header address of each TLP in the Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a transaction to the 
Address Lookup Module in the Switch Core, it uses the header ad&fess stored in 
the Address Lookup FIFO to access the routing information (in thip§ase, theibus 
number) from the Header Buffer. The following informatioiys passeatfeito tj 
Lookup Module: 

• address[43:0] - contains the 8-bit bus number.^^ 

lookup_type[2:0] - field is used to specify th^ral^ctio^ type as shown 
in the following table. The types relativey^his tran^cti(^f type are 




Lookup type [2:0] 


Transaction definition 


3'b000 


^^bi^ory-fransaction 


3'b001 


^bTf^Wfery transaction 


3'b010 ,^ 


^ IL^based transaction 


3'b011 ^ 


h mf ** 3%-bTt I/O transaction 


wwm ■ 


IMBR^e^lt'olKoWleymplexlil^ 


3'bJ'(%. '1 


| Broadcast from root complex 




f Terminate at receiver 


y^bilk ^/ 


Reserved 



port_isf;dowris|gam - lets the Address Lookup Module know what the 
m^y ikely^go^p^equence should be (routed to root complex). 
tc[2 :^ ^®iI||Q Help determine the egress_qid[3:0] for this transaction. 

^ this is used in conjunction with the tc[2:0] field to determine 
theet|ess|^id[3:0]. 



FastFH 



Ifthe|Add ress Lookup FIFO is empty, the routing information is immediately 
^~3hted to the Address Lookup Module in the Switch Core. 



7. In this case, the Address Lookup Module(ALM) knows that the TLP is going to a 
Root Complex, so it only has to figure out the root port. Then it uses the device 
number to determine the mapping of the INTx virtual wire on primary side of 
bridge. The ALM returns int_map[l:0] to the Address Lookup Interface which 
stores it in the Header Info Code in the Header Buffer so that the Packet 
Generator will know to overwrite the Code in the TLP with the Code provided in 
the Header Info Code. 
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2.3. 3 A. 2 Power Management Message Requests 

There are two Power Management Messages that require special handling in the Address 
Lookup Interface in the Rx TLM. They are PME_Turn_Off and PME_TO_Ack. 



2.3.3.1.3 PMEJTurnjOff 

PME Turn Off is generated by a root complex to notify all of its downstream ports to 
prepare for power removal. PME_Turn_Off Message has a routing type of 3'bl01 which 
is 'broadcast from root complex'. In this case, steps 6 and 7 are replaced by the 
following: 

6. The Rx TLM stores the header address of each TLP in the Addre^M^kup HFO. 
When the Address Lookup Interface is ready to present a transaction tllthe 
Address Lookup Module in the Switch Core, it uses the heade^al^ess stpred in 
the Address Lookup FIFO to access the routing information (imtj^if*^ the bus 
number) from the Header Buffer. The following inj$tap&%ri is passed into the 
Lookup Module: "^^k ^ 

• address[43 :0] - contains the 8-bit bus jftjMbenlk r 

• lookup_type[2:0] - field is used to speci&the transaction type as shown 
in the following table. The type.stf#af^^^his transaction type are 



Lookup type[2:Q4|^ 


V Trlcisaction definition 


3'bOOO _ * 


32-l^tlhemory transaction 


• S'bOOjT^ 


%^Bit memory transaction 


3'b0l%, i 


1 F ID-based transaction 




f 32-bit I/O transaction 


y*3'biik "V 


Routed to root complex 






A, ^.bllO ' 


Terminate at receiver 


... X 3%M 


Reserved 




p^^is^downstream - lets the Address Lookup Module know what the 
mosMJcell^lookup sequence should be (routed to root complex). 
©m>tc[2:0t- used to help determine the egress _qid[3:0] for this transaction. 
• frSdp^] - this is used in conjunction with the tc[2:0] field to determine 
khe egress_qid[3:0]. 



Fast Path 

If the Address Lookup FIFO is empty, the routing information is immediately 
presented to the Address Lookup Module in the Switch Core. 



7. In this case, the Address Lookup Module(ALM) knows that the TLP needs to be 
broadcast to all downstream ports configured to the Root Complex, so it looks up 
the endpoints that the TLP should be sent to by asserting the corresponding bits in 
broadcast_ports[15:0]. The Address Lookup Interface then submits a TLP to the 
Transaction Scheduler for each downstream port designated in 
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broadcastj>orts[15:0], along with sending it to the cop (port 16). The Address 
Lookup Interface will then halt all forward progress until it has been notified by 
the COP that the scoreboard is set up and it is ready to receive all the 
PME_Turn_Off Messages from each downstream port. 



2.3.3.1.4 PME_TO_Ack 

The only thing to note for PME_TO_Ack Messages is that when an upstream port sends a 
PME_TO_Ack message, they should all be routed to the COP. The COP will keep a 
scoreboard of all the endpoint ports that received a PME_Turn_Off Message as they 
transmit a PME_TO_Ack message. When all of the endpoint ports have*^{y>mitted>a 
PME_TO_Ack message (or the timer expires), the COP will generate one P^|k TOjAck 
Message to the Root Complex. 

2.3.3.1.5 Locked Transaction Message Requests 

Whenever a particular root requests a locked transaction^^, other ^krce|%oing to that 
output will be halted. When the CplLk is received fr^fl^he^wnstre^ port, all other 

sage is received. 



upstream queues going to the root are locked until the Unlock i# 




2.4 Shared Link Descriptions 
2.4.1 AS Encapsulation 

AS header encapsulation only,|ff^ain^) ShzJed Ports. Non-shared Ports have no 
knowledge of "Shared l/0"*^@n ffi|kRx N^C, the Transaction Layer Module strips the 
AS Header from all TL^sije foEe it sl%es^he packet in the in-line buffer. It uses the AS 
Header to determine th^OS^I^main associated with the given TLP. On the Tx MAC, 
the Transaction Layer Nf6^ule%serts the AS Header to the given TLP. The OS Domain 
that is inserted as^f|ltS^^^S^ader is reported by the Switch Core and passed on to 
the Tx Tran|^G^OT Lky^er MoaUle. 



er is described in detail in the follow section. Figure 2.4-2 describes the 
Heaaer. 



The AS>He 
format of the 




2.4.2 OS, Domain Routing 

For our switch, ports can be shareable ports, which means multiple different CPUs can 
address resources over the same PCI-Express link. A maximum of 16 OS Domains (or 
CPUs) will be supported in this implementation, with each port having the capability to 
send and receive from 16 OS Domains, across all 8 VCs possible in the PCI Express 
spec. 



The PCI-Express Advanced Switching (AS) spec incorporates a 8-byte AS header that is 
inserted into the transaction. Our switch will use this header to specify the OS Domain 
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associated with a given transaction, 
located in the packet. 



The following diagram shows where this header is 



HjEIOflCtg) 




ExAS 
Header , 



Transaction Layer 

Data Layer 

Physical Layer 

Figure 2.4-1: 

AS headers specify an 8-bit field, called the PI (Protocol Interjggpe) fieT^M^iefines 
what type of AS packet is contained in the payload. We'reJ^e^tose a% number 
ranging from 224 through 254 as a vendor-defined PL TJ^pnly omk pieSBe of 
information in our AS header will be the 8-bit OS Do|n[a| n ^k^e proposed AS header 
is shown below. 




+0 

7|6|5|4|3|2|l|0 7 


6 


+ 1 a 

5|4|3|2| lJfO 


\ | h 


<S | 4 | 3 | 2 | l| 0 


+3 

7|6|5|4|3|2|l|0 


R 


R 
N 
P 


RN 


V 


y osd 


PI 


Jl 1 r 



ByteO 



Byte 4 



Figurl 2.4-2: 

PI - Protocol Identifi^fiSai&fci AS^ 
OSD - OS Domain nuffil^er^L 

RNP - Resource Muri^r Btesen#(When high, .the RN field is valid., when low, it is 
invalid and m^M r «a's) 

RN - Reso^^1l^be k r^which buffer this packet belongs to) 
R - reserve! 

On shared ports^e^ftX MAC will first check the PI field to make sure this type of AS 
pacify s unde|stood by our switch. We'll have a register within the RX MAC defined 
that con^^.(he allowable encoding for this type of AS packet. At first, this will likely 
be our selected vendor-defined PI number. If our technology is adopted by the SIG, a 
standard number will be defined, and this register will then be programmed to contain 
this standard value for use with future devices. 



This AS header allows our switch to map the incoming value of the OS Domain field 
(local to the I/O device or inter-switch port) in the AS header to the global OS Domain 
number within the switch (one of 16 values, 0 through 15 for the first revision of our 
switch). Any packets received on the shared port that use a different PI type will be 
discarded. 
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For upstream ports, the RX MAC will only need to associate the OSD number with a 4- 
bit register number loaded at config time. This number will be the OSD field that is 
passed to the lookup_unit to match the correct set of P2P bridge headers to compare for a 
lookup match. 

2.4.3 Flow Control and Credits 

Flow Control is used to track the queue/buffer space available in the Agent across the 
link. It is used to prevent overflow of receiver buffers. The Flow Controljnformation is 
conveyed between two sides of the Link using DLLP packets. 



For Non-shared Endpoints (only one OSD), flow control updates fol^^^^sam^ormat 
as the PCI Express Base Specification, which is as follows: jf 



+ 

7 | 6 | 5 | 4 


0 

3 


2 | 1 | 0 


7 | 6 


+1 

5 | 4 | 3 | 2 | 1 | 0 


7 1 6 




2 % ! 

*3 1 2 1 W 


f +3 
7 1 6 | 5 | 4 | 3 | 2 | 1 | 0 


P/NP/Cpl 


0 


VCID 


R 


HdrFC A 




? DataFC 


16bCRC | X 



ByteO 
Byte 1 



Figure 2.4-3: DLLP format for Flow ^nfr^Wa^^^r^Non-Shared Ports 

P/NP/Cpl: This field specifies the type of^nsa^tio^that is being reported. P - Posted 
Request; NP - Non-posted Request; Cpl - ^^fetio^K 
VC ID: This field specifies the Vii^f^@|^rfa^hat^ being reported. 



R: Reserved 

HdrFC: This field contains the,|gcedi 
value for headers is one maximums izelij 



|ue fop Headers of the indicated type. One credit 
fr plus TLP digest. 

DataFC: This field contajifs th| creljtyalue for payload data of the indicated type. One 
credit value is equivalgftTS^^Sytes cfNlata. 

16bCRC: This field cont^ins^^ calculated CRC value of all bits of the packet using the 



polynomial coeffiS^j|^j 
For ~ 



m 



Shared^ndpoi^s/fl^w control updates are advertised using the following DLLP 
accountjor^he OSl3% ? 



to 



1 _|_q ^^ta^ 
l\ 6 | 51 3 | 2 |l | 0 


7|6| 


+1 
5| 4| 3 


2| 1 | 0 


7|6|5 


+: 

4 


3 | 2 | .1 1 0 


+3 

7 | 6\ 5\ 4| 3 | 2 | l| 0 


w 

101 1 OV2V1V0 


TT 


R 


OSD 


C 
T 




Credit count 


16bCRC 





ByteO 
Byte 4 



Figure 2.4-4: DLLP format for Flow Control Packet for Shared Ports 



Type: Upper nibble set to 1011 for an FC Update shared-link DLLP. The lower 
nibble specifies the VC number. 

TT: Transaction Type (00 for Posted, 01 for Non-posted, 10 for Completions) 
R: Reserved 
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• OSD: OSD number 

• CT: Credit Type (0 for header credits, 1 for data credits 

• Credit count: Contains either the 12-bit data credit count or the 8-bit (upper 4 bits 
are zeros) header credit count, based on the value of the CT bit 

• 16bCRC: This field contains the calculated CRC value of all bits of the packet 
using the polynomial coefficient of lOOBh. 

2.4.3.1.1 Receiver Flow Control 

The RxTLM keeps track of flow control accounting functions of its buffers. This 
information is assembled into a FCUpdate DLLP and forwarded to the TxgTransaction 
Layer Module where it is scheduled to be transmitted to the Agent across tll|4ink 



For each type of information tracked, the following quantities are cal^l^for fj.$w 
control TLP Receiver accounting (for non-shared ports, these^lculatl^sf anpp'erformed 
for each VC, and for shared ports these calculations are perfiS^n^for eSgh VC/OSD 
g rou P) : 



CREDITS ALLOCATED - The total numberXcredit^ranted to the 
Transmitter since initialization, nioduloj2^^^^^ere {"Field Size] is 8 for 
headers and 12 for payload data). 
CREDITS_RECEIVED - The tot^&umter of^Cunits consumed by valid TLPs 
received since flow control initiaJ[za1% (where [Field Size] is 

8 for headers and 12 for payl 



The RxTLM will also check fprAiuff^^yerri|ls. This is done by checking the following 
equation: 

(CREDITS^ALLOC^^^^^DITT RECEIVED) modulo 2 [Fie,d Size] > 2 [Field Size] 12 
The scheduling 0fl^Ba^^^n of UpdateFC DLLPs will obey the following rules: 

ptT^LO or LOs Link state, UpdateFC DLLPs will be scheduled for 
u ssion ipce every 30us or 120us, depending on the status of the Extended 
the Control Link Register. 

"imer will also be implemented with the following rules: 

The Timer is active only when the Link is in the L0 or LOs Link 
state. 

■ The Timer has a limit of 200us. 

■ The receipt of any Init or Update FC DLLP resets the Timer. 

■ Upon Timer expiration, the Physical Layer will be instructed to 
retrain the Link. 

Otherwise, for all types of transactions that do not have infinite credit, a Flow Control 
DLLP will be scheduled for transmission after a valid TLP is received and stored, or 
when one unit is made available by TLPs processed. 
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2.4.3.1.2 Transmitter Flow Control 

The Transaction Scheduler receives the latest FC updates from all the Tx MACs. These 
FC updates report the most recent number of FC units advertised by the receiver on the 
other side of the link called CREDIT_LIMIT. CREDIT LIMIT is used to determine if 
the transactions being transferred to a particular Tx MAC has enough FC credits to be 
forwarded to the appropriate Tx MAC. 



For each type of information tracked, the following quantities are calculated for flow 
control gating: 

• CREDITS_CONSUMED - The total number of FC units consume! 
transmissions made since flow control initialization, modulo^ F,eUn 
[Field Size] is 8 for headers and 12 for pay load data). ^ 

• CREDIT_LIMIT - The most recent number of FC uni^egall^fg^d by the 
Receiver. This quantity represents the total numbe^fif ^jredif^made available 
by the Receiver since flow control initialization, p^ulo 2^^ Slz $t w here [Field 
Size] is 8 for headers and 12 for pay load data) 




To determine if there is enough credit for the cyj^|^ns^£tion,^he following equation 
is evaluated: 



(CREDIT_LINflT- (CREDITS modulo 

2[Field Size] < 2l FieId size J /2 ^0^"' 




Even though the Transacti^^^h^bler^i^emiines Flow Control gating of transactions, it 
does not have any knowigfge of the%[or Status of the transaction. It is not until the 



Transaction Schedulerrfor^F^I the TUP to the Tx MAC that it is known if the packet is 
in error. Therefore, the fx N^^s responsible for notifying the Transaction Scheduler 

tf&ar sMhat it does not affect CREDITS CONSUMED. 



that the current T@|lis§; 




f % 

2.4.4 Relet % 

"re twtivi 



The|l^are tw^^e^/f Reset on the chip - Fundamental Reset and Hot Reset. The 
folding diagpmwill be used throughout the document to describe the devices affected 
whei^y^pe ofReset is asserted at a Root Complex, a Root Port, the Switch, a 
DownsMirfPort, or an Endpoint (I/O device) attached to a Downstream Port. Ports 1, 2, 
3, 9 and 10 are all attached to root complexes and therefore represent one OS domain (for 
simplicity, the OS domain number will directly correlate to the port number, i.e. port 1 is 
assigned OS domain 1 in this example). Switch #1 has been configured such that port 4 
is shared by OS domains 1 and 3, port 5 is only accessed by OS Domain 2, port 6 is 
shared by OS domains 1, 2, and 3, and port 7 is shared by OS domains 1 and 2. 
Downstream port 4 in Switch #1 is connected to Root Port 8 in Switch #2, which enables 
access to the endpoints on Switch #2 from the Root Complexes in Switch #1. 
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re 2.4-5: Example Topology 

\ 

mtalReset 

lan auxiliary signal provided by the system to a component or add- 
ist be called PERST#. Fundamental Reset can be asserted on the 



Endpoints (I/O Devices), or the Switch. 



2.4.4.1.1 J^un 

FundamentikReset 
in carl Thetl 
Roo&Complexes^ 

When Ftiridamental Reset is asserted on the Endpoints or the Switch, the behavior is 
identical to what is described in the PCI Express Base Specification - all of the Links 
attached to the device being reset will be retrained, the state machines will be initialized, 
and all TLP information will be flushed. 



If Fundamental Reset is asserted on a Root Complex, not only does the Root Complex get 
reset, but all of its downstream ports must reset as well. If its downstream ports are 
"shared" with other Root Complexes, it is important to be able to reset only the part of 
the downstream port that pertains to the Root Complex being reset, and to leave the rest 
of the downstream port logic unchanged. This is done by transmitting and generating 
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vendor-specific DLLPs called Reset DLLPs that informs the two components on the link 
which OS Domain is getting reset (this is explained further in the following sections). If 
the downstream port is not shared by Root Complexes, the Link will reset according to 
the PCI Express Base Specification. The following sections describe how the chip 
behaves when a Fundamental Reset is asserted on the various parts of the chip. 



2.4.4 A. 2 Fundamental Reset initiated at the Switch 

If the Fundamental Reset on the Switch is asserted, it will propagate the Reset to all 
upstream and downstream ports. The devices attached to all the ports will be reset 
according to the PCI Express Base Specification - all of the Links attacheckto the device 
being reset will be retrained, the state machines will be initialized, and aH ^iLP . 
information will be flushed. If the fabric topology involves more than one Swrteh, md a 
Root Port in another Switch is affected by the reset, then all of the d3|^|ream p^^T 
assigned to that Root Port are also reset. The components hig^ghted^TiJfifil^/fn Figure 
2.4-6 depict the components that are affected when Switch j^is^&set. ftverylhing inside 
Switch #1 is reset, along with the devices attached to thej^rts. Not^hat^'n Switch #2, 
only one of the root ports is affected by the reset. That^era^p is explained in detail in 
Section 2.4.4.1.3. 

F 
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Figure 2.4-6: Fundamental Reset initiated at Switch #1 

2.4.4.13 Fundamental Reset initiated at the Root Complex 

If the Fundamental Reset on a Root Complex is asserted, the reset must be propagated to 
all its downstream ports without corrupting the traffic from the other OS Domains (or 
Root Complexes). The following steps are taken when a Fundamental Reset is asserted 
at a Root Complex: 

At the Root Complex 

1 . All Port Registers and State Machines must be set to their initial values as 
specified in the PCI Express Base Document. 

2. The Root Complex will attempt to retrain the link 

3. Once both components on the link have entered the initial linf|||| 
will proceed through Link Initialization and then through Flo 
Initialization for VCO 

At the Switch 
1 



The Root Port connected to the Root Complex v^ll 
state machines and registers. 





ain ana|Mtialize all its 



2. 

3. 
4. 



The Root Port will notify the COP that all of its a^wnstrejtm ports need to be 
reset. " ^f^^^X 

The COP will pass the reset notific|Jion to all^he downstream Ports. 
All registers and state machines r€\9m\Ut hxhe ®$D being reset must be set to 
their initial values. All TLPs ^gl@|jpn^to thejg)SD being reset must be flushed - 
all TLPs stored in the Rx inline bu£$|rs v^^naturally drain and all TLPs in the 
Tx retry buffers will drain aft^Ack ^LPs are received. During reset, new TLPs 
belonging to the OSI^II^^ rl|@ptedf All other TLPs will be preserved. 
5. The Downstream Befrt will^Wiotip^i of the reset condition and the OSD that has 

initiated the resjst^wj^\ 
6a. If the Downstfekn Po&is not shared (i.e. it is only accessed by one Root 

Comple^^^mrf% res|j|*by attempting to retrain the Link. 
6b. If the Dow^tf^m 1 ^! is shared, all TLPs pertaining to the OSD that has 

init^te^^^^resei must be flushed. Flow Control must also be updated to reflect 
the iushin^^T5Ps. A vendor-specific DLLP is generated called a Reset DLLP 

vjth^own#"eam Port. The Reset DLLP contains the OSD that initiated the 
reset allfei&transmitted on the link. A Reset DLLP is transmitted every time the 
Transaction Arbiter selects the OSD that initiated the reset. Otherwise, the 
ransgetion Arbiter schedules TLPs to transmit on the other OSDs that are 
operating normally. The Downstream Port will continue to transmit Reset DLLPs 
until the reset notification from the COP has been removed. 



At the Endpoint 

la. If the Endpoint is not shared (i.e. it is only accessed by one Root Complex), the 

port is reset by attempting to retrain the Link, 
lb. If the Endpoint is shared, it will receive the Reset DLLPs and clear all registers, 

state machines, and flush all TLPs that pertain to the OSD initiating the reset. 
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The device will stay in this reset mode until it stops receiving Reset DLLPs. All 
traffic related to the OSDs that are not being reset will operate normally. 

The components highlighted in yellow in Figure 2.4-7 depict the components that are 
affected when the Root Complex attached to Root Port #1 is reset. Since downstream 
ports 4,6, and Tare shared ports, only the logic pertaining to OS domain 1 should be 
affected by the reset. On the other hand, port 14 in Switch #2 is only accessed by the OS 
Domain being reset and can therefore reset the entire port by retraining the Link (instead 
of sending Reset DLLPs specific to an OS domain). 
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Figure 2.4-7 : Fundamental Reset initiated at a Root Complex 



2.4.4.1.4 Fundamental Reset initiated at an Endpoint 

If the Fundamental Reset on an endpoint device is asserted, the device will simply reset 
with its link according to the PCI Express Base Specification - all of the Links attached to 
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the device being reset will be retrained, the state machines will be initialized, and all TLP 
information will be flushed. All other Ports on the Switch will not be affected. 



The components highlighted in yellow in Figure 2.4-7 depict the components that are 
affected when the endpoint connected to downstream port 13 is reset. It simply attempts 
to retrain with the port on the other side of the Link. Any transaction being received by 
the Tx MAC from the Switch Core will be discarded and will never reach the Data Link 
Layer Module. The Root Complex will eventually time out when it never receives a 
completion for a particular request. (We could also let the COP generate UR completions 
since it will know which endpoints are in reset. The ALM could keep track of which 
transactions are going to egress ports that are in reset and then route the palket to the 
COP.) 
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Figure 2.4-8 : Fundamental Reset initiated at an Endpoint 
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2.4.4.1.5 Hot Reset 

Hot Reset is an in-band mechanism for propagating reset across a link. A Link can enter 
Hot Reset if directed by a higher layer, or if it receives two consecutive TS1 ordered sets 
with the Hot Reset bit asserted. The following sections describe how the chip behaves 
when a Hot Reset is asserted on the various parts of the chip. 



2.4.4.1.6 Hot Reset initiated at the Root Port 



in the Bridge 
lit set omthe 



A Root Port can enter Hot Reset by having its Secondary Bus Reset bit s^| 
Control Register, or by receiving two consecutive TSls with the Hot Reset ^ v 
Link. If the Hot Reset on a Root Port is initiated, the reset must be pmpagated^^LlL 
downstream ports without corrupting the traffic from the other OS d^fiMmi (or I%6t 
Complexes). The following steps are taken when a Hot Reset^initiatW^^^dot Port: 

At the Root Complex 

1 . The Root Complex receives a TS1 sequence w { ith thel^t Res|f6it asserted and 
will attempt to retrain the Link. This can happen if the ^Igondary bus reset bit is 
set in the P2P config space. ^S^BRp. ^\ 

2. All Port Registers and State Machines^ustefe I9%"ilfcheir initial values as 
specified in the PCI Express Base J^cument.%^ 

3. Once both components on the link h%%^ered ffie initial link training state, they 



will proceed through Link Inj; 
Initialization for VCO 
At the Switch 
1. 



2. 



3. 
4. 



The Root Port connected 
state machines ar^egi^ters ^ 
The Upstream^orf%ll^ibtify 




ben through Flow Control 



Complex will retrain and initialize all its 
COP that all of its downstream ports need to be 



The COE^^^5|^^^ relet notification to all the downstream Ports. 
All registers aod sta^rhachines relevant to the OSD being reset must be set to 
thei|1nitiai^lu\^All TLPs belonging to the OSD being reset must be flushed - 
11 TfcPs stoii| in r the Rx in-line buffers will naturally drain and all TLPs in the 
^r^y)uffefs will drain after Ack DLLPs are received. During reset, new TLPs 
belongifff^fo the OSD will be rejected. All other TLPs will be preserved. 
k The Downstream Port will be notified of the reset condition and the OSD that has 
i^§j€d the reset. 

6a. If the Downstream Port is not shared (i.e. it is only accessed by one Root 
Complex), the port is reset by attempting to retrain the Link. 

6b. If the Downstream Port is shared, all TLPs pertaining to the OSD that has 

initiated the reset must be flushed. Flow Control must also be updated to reflect 
the flushing of TLPs. A vendor-specific DLLP is generated called a Reset DLLP 
by the Downstream Port. The Reset DLLP contains the OSD that initiated the 
reset and is transmitted on the link. A Reset DLLP is transmitted every time the 
Transaction Arbiter selects the OSD that initiated the reset. Otherwise, the 
Transaction Arbiter schedules TLPs to transmit on the other OSDs that are 
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operating normally. The Downstream Port will continue to transmit Reset DLLPs 
until the reset notification from the COP has been removed. 
At the Endpoint 

la. If the Endpoint is not shared (i.e. it is only accessed by one Root Complex), the 

port is reset by attempting to retrain the Link, 
lb. If the Endpoint is shared, it will receive the Reset DLLPs and clear all registers, 

state machines, and flush all TLPs that pertain to the OSD initiating the reset. 

The device will stay in this reset mode until it stops receiving Reset DLLPs. All 

traffic related to the OSDs that are not being reset will operate normally. 

2.4.4.1. 7 Hot Reset initiated at the COP ^fk 

A Hot Reset can be initiated at the Switch through a set of Registers,|hat can%^tccefeed 
via an I2C interface. The Hot Reset can be programmed such that it^^perate^^a per 
port and/or per OSD basis. If the Hot Reset is propagated to Pg|ts witlb^^li^nguishing 
between OSDs, the Ports will be reset according to the PCIJ^pff^ra ** a ^L Specification - 
all of the Links attached to the device being reset will be^jained^%| stale machines 
will be initialized, and all TLP information will be flu^M|d/^^ie Hot§Keset is 
propagated to a subset of the OS domains on a particular Port, the; Port will use Reset 
DLLPs to reset the designated part of its port lo^4>ertailFiing to the OS domain specified 
in the Reset DLLP. 




2.4.4.1.8 Hot Reset initiated at a Do>m^m P0 

A Downstream Port can enter Hot R^fi^^h^feg i|#Secondary Bus Reset bit set in the 
Bridge Control Register. The Port^ll tra^l|it jjgfet DLLPs on the OSDs that 
correspond to the Secondary ^j^Re^^its tl|fit were asserted and clear all Register and 
State Machines pertaining t^em). i||j€ffic related to the OSDs that are not being 

rpcpt will nn^rat^ nr\rm a llxP ., ^Ibv. W 




at a Shared Upstream Port 

A Shared UpstreatiPi]|^^^nt# Hot Reset by having its Secondary Bus Reset bit set in 
its Bridge C^Bt^^Registerrinie Port will transmit Reset DLLPs on the OSDs that 
correspond|o thel^o^^py Bus Reset bits that were asserted and clear all Register and 
State M^chi^s perta^iing to the OSD. All traffic related to the OSDs that are not being 
resey^fil'opSf^^n^iially. 

2.4.^^J0 )jHot Reset initiated at an I/O Device 

If a devixlP#ants to reset itself, it can do so by either transmitting Reset DLLPs or by 
transmitting TSls. If the device wishes to only reset a particular OSD, it will generate 
and transmit Reset DLLPs that specify the OSD to reset. It will also clear all Registers 
and State Machines pertaining to the OSD. If the device wishes to reset the entire link, it 
will generate. and transmit TSls with the Hot Reset bit asserted to reset the entire link. It 
will also initialize all Registers and State Machines and attempt to bring up the link. 

2.4.5 Power Management 

PCI Express Power Management provides the following services: 
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It allows software driven D-state transitions to change the Link power 
management states for a physical Link. 

It provides a hardware-autonomous capability to change the Link power 
management states for a physical Link (Active State Power Management). 
It provides a wakeup mechanism driven by in-band TLPs routed from the 
requesting device towards the Root Complex - these are called power 
management event (PME) Messages. 

It provides a means to change the Power Management state by generating PMEs 
per PCI Express function. 



2.4.5 .1 .1 Link State Power Management 



A PCI Express physical Link can enter Link power management states by eitHel^oftware 
driven D-state transitions or by active state Link power management IliMr : ~ * v J 
Link states include LO, LOs, LI, L2, and L3. The power saving4increa?|/as 



ies. 



^fined 
Link 

state transitions from LO through L3. Table 2.4-1 and TabfeS.4^|sumi%rizes the Link 
Power Management States for both non-shared I/O and sjifbd I/O^k W 

\ 





L-State 
Description 


S/WPM? 


s \Usedtfy 
^SPM? 


Clocks & 
Power 


LO 


Fully Active Link 




Yes (DO) 


On 


LOs 


Standby State 




r Yes (DO) 


On 


LI 


Lower Powef*^ 
StandM% v 




No 


On 


L2/L3 Ready 


StagingAmt% 
Pow,e1men%>yal ~ 


|Yes 

££. 


No 


On 


L3 


,4k Off 


• n/a 


n/a 


Off 



Table 2.4-lTSummai^ of Non-Shared I/O Link Power Management States 




% 

i 


^ ^L-Stat4 - f 
v Inscription 


Used by 
S/W PM? 


Used by 
ASPM? 


Clocks & 
Power 


Shared I/O 


LO | 


Fuljy^Ktive 
jLink 


Yes (DO) 


Yes (DO) 


On 


Normal Operation 


LOs 


SfeSiandby State 


No 


Yes (DO) 


On 


TLP & DLLP transmission is 
prohibited for all 
OSDs(ASPM only) 


L1-L3 


Lower Power 
States 


No 


No 


On 


TLP & DLLP transmission is 
prohibited for a specific OSD 



Table 2.4-2: Summary of Shared I/O Link Power Management States 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 30 of 222 



NextIO, Inc. 

© All Rights Reserved. 



NEXSIS Overview Document 

V0.8 



2.4.5.1.2 Power Management Software Control 

One of the ways that power management states of a Link are determined is by the 
software driven D-state of its downstream component. Table 2.4-3 depicts the 
relationship between the power state of a component and its Upstream Link. A 
Downstream component can be an Endpoint or another Switch. 



Downstream 
Component D-State 


Permissible Upstream 
Component D-State 


Permissible 
Interconnect State 
(for Non-Shared I/O) 


Permissible Interconnect 
State 
(for Shared I/O) 


DO 


DO 


LO, LOs 


y LO, LOs 


Dl (optional) 


D0-D1 


LI 


LO/^I^CCannot go into 
a low le%l pow^ state) 


D2 (optional) 


D0-D2 


LI 


%^LOs (^pSt go into 
ilo%fev T el power state) 


D3 hot 


D0-D3 hot 


Ll,L2/L3Read$^ 


L§|COs 1f€annot go into 
,„ a Iqwievel power state) 


D3 co id 


D0-D3 co i d 


L3/% 

A % \ 


^^LO^Cbs (Cannot go into 
flow level power state) 



Table 2.4-3: Relation between Power Management States of Liriktand Components 
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Upstream Component 



Downstream Component 



©/upstream component sends 
L configuration write request 



T D P I P D T 



Upstream component blocks 
scheduling of new TLPs 

Upstream component receives 
acknowledgment for last TLP 

{Upstream component sends 10* 
PM_Request_Ack DLLP 




Upstream component sends 
PM_Request_Ack DLLP 
continuously until it 
electrical idle 




Downstream component 
begins L1 transition process 

Downstream component 
blocks scheduling of new TLPs 

Downstream component waits 
to receive Ack for last TLP^ 

PM_Enter_L1 DLLPs \ || 
sent continuously J 



Upstream component completes 
L1 transition; disables DLLP, 
TLP transmission and brings 
Phy Layer to electrical idle 



Downstream components waits 
for PM_Request_Acj< DLLP, 
acknowledge 
PM Enter L* 



Downstream components pees 
PM ^equeskAck DLLP^ables 
" ' ^ TIP transmission and brings 
Phy La^^o electrical idle 




T - Transaction 
D - Data Link 
P - Physical 




\FigBk^.4-9: Entry into LI Link State 



2.4.5J^3\^q^ve State Power Management (ASPM) 

\ctife State Poller "Management (ASPM) is an autonom 



2 * 

Actijfe State Povl^F^anagement (ASPM) is an autonomous hardware based active state 
mech^ism thajt enables power saving even when the connected components are in the 
DO st^^^ij^l^operational state)? After a period of idle Link time, the ASPM mechanism 
engages in a Physical Layer protocol that places the idle Link into a lower power state. 
Once in the lower power state, transitions to the fully operative L0 state are triggered by 
traffic appearing on either side of the Link. This feature may be disabled by software. 

Since ASPM is initiated by the link being in idle for a specified amount of time, the 
physical layer can be placed in a lower power state regardless of whether the component 
is shareable or not. When any traffic (regardless of OS domain) appears, the link is 
placed in the fully operative L0 State. 
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2.4.5.1.4 Power Management Event Mechanisms 

2.4.5.1.5 Link Wakeup 

PCI Express components are permitted to wakeup the system using a wakeup mechanism 
followed by a power management event (PME) Message. These PME Messages are in- 
band TLPs routed from the requesting device towards the Root Complex. This PME 
mechanism is broken up into two tasks: 

• Reactivation (wakeup) of the associated resources 

• Sending a PME Message to the Root Complex 



The Link wakeup mechanisms provide a means of signaling the platform tcl^estabHsh 
power and reference clocks to the components within its domain. T^ere are tw^etfjried 
wakeup mechanisms: Beacon and WAKE#. The Beacon uses in-ban^l%pling fF 
implement wakeup functionality. WAKE# is an input to the §*^|tch, a^in fe|]fjbnse to 
WAKE# being asserted, the Switch must generate a Beaco^h^t^)rop|gated to the 
Root Complex. 

The Switch must translate the wakeup mechanism appr^griately^ben some ports use the 
beacon mechanism and others use WAKE#. Tl^^OTwil^ keep ^scoreboard of the 
downstream ports wakeup states, and when alMhe^Qwns#lam ports of a specific Root 
Complex have been woken up, the COP vj^^^^r s^rt^ beacon or WAKE# to the Root 

Regardless of the wakeup mechaai^^^^^ic^^Link has been re-activated and 
trained, the requesting agent tlj^pro^kgates J PMPME message upstream to the Root 
Complex. : 

2.4.5.1.6 PME Messages 



PCI Express devicM?n'ejeA|M)e notified before their reference clock and main power is 
removed so tMt^^m prepare for it. This is done as follows: 

1 . Before povf|r antLelocks are turned off, the Root Complex (or Downstream Port) 
Jissu|||a PN/^Tmn_Off message to all agents downstream to cease initiation of 
any sc$||equpMt PM_PME messages. 

Each agentis required to respond with a PME_To_Ack TLP, which must 
ytermir^ate at the point of origin. 
^&©ath agent responds with a PME_To_Ack TLP, the TLP is received by the 
endpoint port and routed to the COP. When the COP receives PME_To_Ack 
TLPs for all of a particular Root Complex's downstream ports, the COP 
generates and sends a PME_To_Ack TLP to the Root Port. 

4. Once an endpoint port has sent the PME_To_Ack packet, it must then prepare for 
removal of power and clocks by initiating a transition to the L2/L3 Ready state. 

5. The Switch is responsible for making sure that the upstream port goes to L2/L3 
Ready state after all its downstream ports have entered L2/L3 Ready state. It 
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should not wait indefinitely for the PME_To_Ack packet, but should implement 
a timeout mechanism where it would assume that the PMETurnJDff TLP was 
received after the timeout expired. 
6. The Power Delivery Manager must wait a minimum of 100ns. after the Root 
Complex transitioned to L2/L3 Ready before removing power and clocks. 



2.4.6 Switch initialization 

There are some basic events that must always happen in whenever our dt 
from a software/configuration point of view. 

1. I2C initialization 

2. Link training 

3. OSD negotiation 

4. System setup update (optional) 

5. FC initialization 

6. Pseudo-Device Discovery (optional) 

7. Device Discovery 

Each will be covered in more detail in the 

2. 4. 6. LI 12 C initialization 

There will be hardware defaults fofeany olihe r|f Isters in our chip that should provide 
a functional chip at boot time.^here<&|e, hov|ever, a few structures that must be 




provisioned by the I2C integfege afeboo 
system is being created 



There will be a default 





■since the hardware has no idea what type of 



||f OSDs to port set up by the I2C. This means the I2C 
will set up the v^^^^^^^q)o\qs so that transactions that enter the switch know 
where to go fe|J|£ various OSDs on that port. The following structures must be 
provisionec|with ^^abl^values by the I2C for the switch to boot: 

Wfegiappihg table (indexed by osd[3:0]) in the MAC: set to all 0s except for 
TCO whiffy is hardwired to 1 on VCO. Note that there is one of these tables per 

. ort - J 

lunation QID RAM in the Address_lookup_module (indexed by 
{destjDort[4:0],osd[3:0],tc[2:0]}: All entries for dest_port=16 should be 
provisioned to return some 5-bit number as the dest_qid for all OSD/Tc 
combinations the system expects to appear. Again, the valid bit should be set for 
these entries. There is only one of these in the switch. 




For example, if the EEPROM expects a 2-OSD device to be plugged into port 8, and this 
device should talk to ports 0 and 1, the I2C will write the data in the Destination QID 
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RAM at address {5'dl6,4'd0,3'd0} to a value of 0. It will also write the address at 
{5'dl6,4'dl,3'd0} with a value of 1. By writing these values (and setting the valid bit for 
each address), initial transactions will be queued in the switch core and eventually routed 
to the COP through the correct queue groups. 

2.4.6.1.2 Link training 

This protocol should follow the base specification mechanism to allow a given port to 
train as a lx, 2x, 4x, or 8x link. Since our switch is actually composed of 8x and 4x PCI- 
Express cores (an 8x core plus a 4x core will be contained in a MAC), the 8x core in each 
MAC will attempt to train first. If it trains in anything less than 8x, it'll "ton on" the 4x 
core and allow it to attempt to train. If it trains to 8x, the 4x core will ne< 
since all 8 lanes are in use by the 8x core. 

2.4.6.1.3 OSD negotiation 



In order to minimize the required console management so^ajre^sage, l^jr device will 
support auto-negotiation of the number of OSDs that are^Iesent on^pivgn shared I/O 
port. The EEPROM will configure the allowable nur|jib% ort|SDs fbftach shared port 
and will be loaded as the default configuration of thes^gtem. 




Once our switch completes link training wh^^n^^iflPi^Dabled (due to a plug-in card 
being added or just coming out of reset sys|emj^n will begin the process of 
figuring out what types of devices it is con^^pd\) ojn e$ch port. A new procedure is 
defined to support this, using a new PB^JJhat^sho^n here: 



+0 

7|6|5|4|3|2|l|0 


7 


kki Ha| 2\%\j 


+2 

7| 6 | 5 | 4j 3 | 2 


1|0 


7|6 


+3 

5| 4| 3 | 2 | l| 0 


Type 
0000 0001 


P. 




R. 


VN 


R 


OSD Cnt 


: X,. 





ByteO 



Byte 4 



Type - glv^ys set t%OO0l)OOl for an OSD negotiation DLLP 

PH ^hWmOSD Negotiation (0 for InitOSDl DLLPs, 1 for InitOSD2 DLLPs) 

R-|eserved"%^ 

VN ^Wersion |iumber (set to 00 for base OSD negotiation) 

OSD G^y<0f InitOSDl DLLPs, the number of OSDs present in the device... for 

InitOSD2 DLLPs, the negotiated number of OSDs for the link 



The protocol very closely resembles the base specification's method of FC initialization. 
For the "shared I/O base mode" negotiation, the VN field must be set to 00. This means 
that the OSD and VC will explicitly be used for all wireline communication between two 
devices. 

For "shared I/O extended mode", the VN field can be set to 01 which means. This mode 
means that the RN field will be used to explicitly map traffic onto buffer resources 
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between the link partners. This allow a larger number of OSDs to share a buffer if 
desired. 



1. The OSD state machine will begin transmitting the same packet, an InitOSDl 
DLLP, over and over every clock cycle until it receives an InitOSDl DLLP from 
the link partner's OSD state machine. At that point, the device will check the 
value of the "OSD Cnt" field received from the link partner and compare that to 
the switch's number of OSDs (which was being advertised in the "OSD Cnt" field 
of the packets it initiates). The lesser of the two numbers will be the negotiated 
number of supported OSDs for that link. If the link partner never spnds any 
InitOSDl DLLPs after a timer expires (3 us), it is a non-shared po^nd our swich 
will proceed to normal flow control initialization. 

2. At that point, the OSD state machine will continue sending c^|muous IffMSDl 
DLLPs, but now it'll put the newly negotiated value injhe "OSgraRB^jeW. 
Once it receives a DLLP from the link partner with the sime nurffber in the "OSD 
Cnt" field, the state machine moves on to step 3. jk 

3. The OSD state machine, now transmits the same?pa^^xceptl%ehds it as an 
InitOSD2 DLLP by setting the PH bit in the ©t^P. flfikfact that the state 
machine is now sending this type of DLLP meari&^hat it^hSerstood the OSD 
negotiation procedure from its link partn^^P^g^^ also started at this point, 
and if the timer expires (3 us) befom,'M , hiit0^p2 is received from the link 
partner, the state machine is reset ^^^^hg^pro^^^begins again. If the state 
machine receives an InitOSD^J^m ils^ik pjfrther, OSD negotiation is complete 
and it stops transmitting In^SD2%JJ 



2.4.6.1.4 Shared resource initim^atwjf 

Ibwn o 



Once the number of Oj&Ds^Jpl&wn opa link, shared resource initialization begins if the 
VN field was set to 01 auring^SD negotiation. If the VN field was 00, this step is 
skipped. 




This mech^isnril^ws^^ie two devices to map multiple OSD/VCs onto a common 
resourc^, oi|buffer, i%desrfed. The results of the shared resource initialization will be 

krltaster fog software to read during system setup update. If any remapping of 
ICs to fififefeFS'ls required, it can be done at that point. 

| 

This pMj^^l^erforms the same basic steps as OSD negotiation. If a link partner does 
not respond with InitRNl DLLPs within a 3 us time interval, it does not support shared 
resource initialization. 



ByteO 
Byte 4 



+0 

7 1 6 | 5 | 4| 3 | 2 | 1 | 0 


7 


+1 

6|5|4|3|2| 1 |0 


+2 

7| 6 | 5 | 4| 3 | 2 | 1 1 0 


+3 

7|6|5|4|3|2|l|0 


Type 
0000 0010 


P 
H 


R 


RNCnt 


LCRC 
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Type - always set to 0000 0010 for a RN initialization DLLP 

PH - Phase of RN initialization (0 for InitRNl DLLPs, 1 for InitRN2 DLLPs) 

R - Reserved 

RN Cnt - Total number of shared buffers that can be used by the link parnet 

The step by step breakdown of the shared resource initialization protocol is shown below. 



2. 



The RN state machine will begin transmitting the same packet, an InitRNl DLLP, 
over and over every clock cycle until it receives an InitRNl DLLP from the link 
partner's RN state machine. j 
The RN state machine now set the PH bit so that the DLLPs are n'^ynitRN2 
type. The fact that the RN state machine is now sending this^type of^LLP mfeans 
that it understood the RN initialization procedure from its linfe^ner. i%pner is 
also started at this point, and if the timer expires (3 us) ^efore an ^i^^[2 r Vs 
received from the link partner, the state machine is resSf%jd the^procell begins 



again. If the state machine receives an InitRN2 frog its liraipDarttter, OSD 
negotiation is complete and it stops transmitting^mil^2 DLll|sC 





If this step is skipped because a link partner doesnot support it, fffe'routing header will 
always have the RNP bit set to 0 since the RN.fT^dS^iiiJrabe used. 



2.4.6. L 5 System setup update 

At this point, the number of OSDs (d^QgtpMMy nujhber of buffer resources) available 
in the link partner is known by th^|rdwai^Bo^values are written to a register so that 
software can see the results. 



Based on the OSD negotiation £hat fe^blfcned the allowable OSDs on that port, the 
switch will write to th^l6%jiffeWes(^ce registers based on the results of OSD 
negotiation. These registers contain some fields that specify what the encodings are 
(internal to the sM^4^Roi%6 a particual OSD/VC. The EEPROM will have 



already set 

For exarnplj 
this: 




ounts for both the header and data buffers. 



SDs were negotiated, the 16 registers might look something like 



Bufflkresourcl 0 Register: OSD=0, VC=0, valid=l 
Buffe^^ei 1 Register: OSD=l, VO0, valid=l 
Buffer resource 2 Register: OSD=x, VC=x, valid=0 



This means that ingressing TLPs on OSD0/VC0 will use buffer resource 0 space since the 
link partner will set the RN=0 when it sets the OSD=0 in the AS header. Incoming TLPs 
on OSD1/VC0 will have RN=1, OSD=l in the ExAS header. 

The hardware will now pause and query a control bit, halt_on_osd_complete, to 
determine what to do next. If that control bit is set (by the EEPROM during system 
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ByteO 



Byte 4 



boot), the hardware will wait for software to clear it before proceding to FC initialization. 
If the bit is cleared (ie, wasn't set by the EEPROM programming), the hardware will 
proceed to FC initialization immediately. 

The purpose of this bit is to allow the system software/console management software an 
opportunity to take a look at what the results were of OSD negotiation. It can then go re- 
provision the buffer resources prior to FC initialization by reallocating space to different 
OSDs if necessary (and possibly VCs; this particular topic is covered in a sub-section 
below). Note that the hardware makes no assumptions on the "correctness" of this 
reprogramming and will make no attempt to recover from invalid programming at this 
point. 




2.4.6.1.6 Flow Control (FC) initialization 

If only 1 OSD is present as a result of OSD negotiation or QS^^gotialon was skipped 
because the link partner is a non-shared device, the state n^hine w||| begiti the normal 
PCI-Express base specification FC initialization proc^t(re^^^wever^if more than one 
OSD is present, the state machine will begin "sharedT/O base initialization". If the 
shared resource initialization step was successf^|^state^ will begin "shared 
I/O extended FC initialization." There is als^notiWf^^ffagism called Buffer Retry 
Mode that is not implemented and is explainf d 



horn 

section 0. 



2.4.6.1. 7 Shared I/O Base FC 

A new DLLP is created to conj^g 
here: a 




initialization information. This DLLP is shown 



+0 

7|6|5|4|3 1-2 | 1 | 0 




K\ 

M 3 


2| 1 I 0 


7 | 6 | 5 


+7 
4 


3 | 2 | l| 0 


+3 . 

7 | 6| 5| 4| 3 | 2 | l| 0 


Type 

0111 0v 2 v,. ¥te ^. \ 


,.H 




yt 


OSD 


C 
T 




Credit count 








Figure 2.4-11: InitFC-H/InitFC-D DLLP format 



Type ^iigg||^iibble set to 01 1 1 for an FC initialization shared-link DLLP. The lower 
nibble specifies the VC number. 

PH - Phase of FC Initialization (0 for InitFCl DLLPs, 1 for InitFC2 DLLPs) 
TT - transaction type (00 for Posted, 01 for Non-posted, 10 for Completions) 
R - Reserved 

OSD - OS Domain, ranging from 0 to 63 to specify the unique OSDs on the link 

CT - Credit type (0 for header credits, 1 for data credits)... this basicially identifies the 

DLLP as either an InitFCl_H or an InitFCl_D 

Credit count - contains either the 12-bit data credit count or the 8-bit (upper 4 bits are 
zeros) header credit count, based on the value of the CT bit 
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This section shows how shared-port FC initialization (where shared resource initialization 
was skipped) is performed. The pattern will closely resemble the base specification 
method of advertising normal FC. The state machine begins sending InitFCl JI and 
InitFClJ) DLLPs. It will send them in a repeating sequence as shown below as an 
example of a link that negotiated 2 OSDs. For this example, our switch was configured 
to only enable VCO on each OSD: 



InitFCl_H (OSD = 0, VC = 0, Posted, header) 
InitFClJ) (OSD = 0, VC = 0, Posted, data) 
InitFCl_H (OSD = 0, VC = 0, Non-posted, header) 
InitFCl_D (OSD = 0, VC = 0, Non-posted, data) 
InitFCl Jl (OSD = 0, VC = 0, Completion, header) 
InitFClJ) (OSD = 0, VC = 0, Completion, data) 
InitFClJ! (OSD = 1, VC = 0, Posted, header) 
InitFCl_D (OSD = 1, VC = 0, Posted, data) 
InitFCl_H (OSD = 1, VC = 0, Non-posted, header) 
InitFClJ) (OSD = 1, VC = 0, Non-posted, data) 
InitFCl_H (OSD = 1, VC = 0, Completion, header 
InitFClJ) (OSD = 1 , VC = 0, Completion, d; 




So for this example, 12 unique DLLI^^^cai^late^ie credits for each OSD/VC 
enabled. Anytime more VCs are ^^bled^^g^^normal mechanism for PCI-Express, 
this procedure is run in the same^ft^lc 



Since VCO should alwayy^enabf 
DLLPs from the link part% ) 




itch should "expect" to receive the same 12 



This pattern will continue Wil llfecorresponding DLLPs have been received from the 
link parter. At tnat^taefmfe>yitch will begin sending the same sequence of DLLPs, 
except they|^ill^|r^nit^C2_H and InitFC2_D DLLPs (again, all 12 in a repeating 
sequence), fpnce t^sw^eh receives an InitFC2 DLLP from its link partner, FC 
initializI%ipfe|comptfete. Note that whenever the FC1 phase is complete, TLPs can 
beg® transmiHkg#h the link. FC2 is just used to finally complete the handshake. 

The BBl|RX)M^will contain the pre-calculated amount of credits to advertise and load 
these values into our internal registers at boot time. This can result in non-optimal buffer 
usage in the event the default provisioning is set up for a device that does not contain the 
same number of OSDs and/or VCs. The actual hardware defaults will assume all 16 
OSDs are enabled, all with VCO. As such, the credits will be equally split across all 16 
OSDs for all transaction types. 



2.4.6.1.8 Shared I/O extended FC init 
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This section shows how shared-port FC initialization (where shared resource initialization 
was not skipped) is performed. The pattern will closely resemble the base specification 
method of advertising normal FC. The state machine begins sending InitFCl_H and 
InitFCl _D DLLPs. It will send them in a repeating sequence as shown below as an 
example of a link where we have 2 shared resources (RN = 2) as provisioned by the 
EEPROM: 




InitFCl_H (RN = 0, Posted, header) 
InitFCl_D (RN = 0, Posted, data) 
InitFCl_H (RN = 0, Non-posted, header) 
InitFClJ) (RN = 0, Non-posted, data) 
InitFCl_H (RN = 0, Completion, header) 
InitFCl_D (RN = 0, Completion, data) 
InitFCl_H (RN = 1, Posted, header) 
InitFCl _D (RN = 1, Posted, data) 
InitFCl_H (RN = 1, Non-posted, header) 
InitFCl_D (RN = 1, Non-posted, data) 
InitFCl_H (RN = 1, Completion, header) 
InitFCl_D (RN = 1, Completion, data) 

2.4.6.1.9 Buffer Retry Mode 

In Buffer Retry Mode, FC Initialization vvf 
specified in the PCI Express Base S 
transparent at this stage - FC Initig^ation 



^ v nly on a per VC basis. The 
partitioning of the buffer resou^s pt^£SD Within each VC is determined during the 
configuration of the device-^e^^regi^^/ 

In Buffer Retry Mode^|F^Wi^Nja^^ffer is partitioned by VC as well as by OSD, and 
sets aside a "surplus"' aift^r^^memory that all types of packets can access regardless of 
OSD. The partiti^i^g^the ni|mbry for Buffer Retry Mode is shown in Figure 12 and 
the variables tojbe ^ograifirfied are shown in Table 4. 



Retry Buffer Mode* 




^l£^erfonTi%d according to the way it is 
icifexJf the port is shared, it will be 
on! 




P2 TOTAL MEM< 



P1_OSD(0)_RSVD_MEM 



P1_OSD(1)_RSVD_MEM 



P1_OSD(m-2)_RSVD_MEM 



P1_OSD(m-1 )_RSVD_MEM 



P1_VC(1 LSURPLUS_MEM 



* In default mode, there are 
no further partitions beyond 
the segments shown above. 



Where: 

n = number of VCs 
m = number of OSDs 



** In retry mode, the memory is 
partitioned as shown above. 
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Figure 12: Breakdown of the Data Buffer 




Since Flow Control DLLPs report credits solely by VC, and the Buffer Retry Mode 
breaks the buffer down even further into OSDs per VC, a packet could be transmitted that 
had enough flow control credit, but would still not get stored in the data buffer. This 
would happen if the packet belonged to an OSD that no longer had free space in its 
section of the data buffer, even though there appears to be credit according to flow 
control. 

Instead of dropping packets that could not get stored in the data buffer ddf 
resources for a particular OSD, the Ack/Nak DLLPs are used to info^ 
the link to retry the packet. According to the PCI Express Base Spec^sa|ion, an^ 
Ack/Nak DLLP is generated in the receive side of the Data Lijj^Layer^^l%^ansmitted 
to the other side of the link to inform the transmit side of thpiQat%Link lUyer to either 
free its buffer resources that corresponds to the packet (ifi^&k'd bei 
data integrity errors), or retransmit the retry buffer (if .Jslf k' 
integrity errors). If repeated attempts to transmit a Tf 
will instruct the Physical Layer to retrain the link**^ 

The VMAC takes the Ack/Nak DLLPs to apther levfelby allowing the Rx Transaction 
Layer Module to alter the DLLP dependin^n^ymfethgr |rte TLP is able to get stored in 
the Data Buffer. If the TLP cannot ^011^d^^^k)LLP that corresponds to the TLP 
is changed to a Nak DLLP and is t^smitt^to t^rather side of the link. One of the 
reserved bits in the Nak DLLP^ll be smto differentiate between a Nak caused by a 
data integrity error (which cpi^ufti^ate^e^ain the link) and a Nak that is caused by a 
buffer retry condition. T^^bitjs srf<||Ti ufred in Figure 13 (Note: Using reserved bits 
is acceptable since norj^zei^&affies in 'Reserved fields are ignored and will not cause 
errors with other PCI Exwes^iriks). 



ise yhere were no 
cause tft'ere were data 
" are i?nsW;§ssful, the transmitter 



If the TLP has^e^ugh^resoli'reies in the data buffer, but is stored in the surplus segment, 
one of the r^serve^bitsHn the Ack DLLP will be set to notify the other side that the OSD 
corresp^idi^ to tha^parhcular TLP is reaching buffer saturation. This bit will be used 
by tr^Tran^ll^on Arbiter in the Tx Presentation Module. The Transaction Arbiter will 
skipfihe Trans^iSff Queue that corresponds to the particular VC/OSD group on the 
proxlrmte turr| This will give the Rx data buffer some time to free up its resources. 



SPr<^rammable Variables 


Mode 


IWidthl 


Description 




Pl_TOTAL_MEM 


Default 

& 
Buffer 
Retry 


00 
01 
10 

11 


32KB total memory allocated to Port 1 
64KB total memory allocated to Port 1 
96KB total memory allocated to Port 1 
128KB total memory allocated to Port 1 


Pl_VC(n)_MEM 


Default 

& 
Buffer 
Retry 


6 bits 


0x0= 128KB, 0x1 = 2KB, 
0x2 = 4KB, ...,0x3F= 126KB 
where 
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Pl_VC(n)_SURPLUS_MEM 


Buffer 
Retry 
Only 


3 bits 


0x0= 16KB, 0x1 = 2KB, 
0x2 = 4KB, ...,0x7= 14KB 

where PI VC(n) SURPLUS MEM < PI VC(n) MEM 


Plj3SD(m)_RSVD_MEM 


Buffer 
Retry 
Only 


6 bits 


0x0 = 128KB, 0x1 = 2KB, 
0x2 = 4KB, ...,0x3F= 126KB 
where 


1 v- **** : > .# # ; °w~ #; 


I 




^ -1 . - 0 . ■' * - ftp;- . | 


P2_TOTAL_MEM 


Default 

& 
Buffer 
Retry 


N/A. 


P2 TOTAL MEM = 128KB - P 1 TOT AL MEM 


P2_VC(n)_MEM 


Default 

& 
Buffer 
Retry 


6 bits 


0x0 = 128KB, 0x1 = 2KB, 

0x2 = 4KB, . . ., 0x3 F = 126KB \ 


P2__VC(n)_SURPLUS_MEM 


Buffer 
Retry 
Only 


3 bits 


0x0 = 16KB, 0x1 = 2KJ& X^^W 

0x2 = 4KB,... ,0x^41% \ 

where P2 VC(n)^£URPLlI^MEN%< P2 VC(n) MEM 


P2_OSD(m)_RSVD_MEM 


Buffer 
Retry 
Only 


6 bits 


0x0 = 128KB^l%gKB, 

0x2 = 4KB,#3)x3F%26KB * 

where 



Table 4: Programmabfiyariables ffet^)ata Buffer 



ByteO 



Byte ! 



+0 

7|6|5|4|3|2|l|0 


7 1^X^2X0 


+ 

7|6| 5 |4 


2 

3 1 2 | 1 | 0 


+3 

7|6|5|4|3|2|l|0 


0000 0000 -Ack V Reserved 

0001 0000 -Nak H %^ 


AckNak_Seq_Num 


16bCR©%. 




%Figure 13 : DLLP Format for Ack/Nak Packets 



2. 4. %%J. 1 0 SPseudo-device discovery 

Once th^g^nit is done, the devices are ready to exchange TLPs, so device discovery 
can proceed. This is another optional checkpoint at which we could implement another 
halt bit, halt_after_fc_init If the I2C set this bit at boot time, the hardware will now 
pause again. Console management software can now come in using the I2C bus and use 
the COP to emulate device discovery by acting like a Root Complex. It could send typeO 
and typel config cycles to the newly-plugged-in I/O device and figure out what VCs, 
buffers, and OSDs that device had available. It would load this information into local 
memory and essentially restart the whole process again for that I/O device. 
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This is an advanced feature that allows software the hooks it would need to set up optimal 
resources across VCs and buffers before real device discovery happens from the OS. So 
the steps would be as follows: 



1. I2C initialization 

2. Link training 

3. OSD negotiation 

4. FC initialization 

5. "Pseudo" Device Discovery 

6. Load results into local memory of console processor 

7. Link training 

8. OSD negotiation 

9. System setup update (use newly calculated optimal buffer 

10. FC initialization 

1 1. OS Device Discovery 



2.4.6.1.11 Discovery 

The discovery mechanism for shared devices usi 




,the standard P$!I-Express mechanism 
for discovery per OS domain. This mechanipri sem^ds^^B^and Typel configuration 
cycles to determine which devices, if any, afe pr^senNbehind a link. Each one of these 
cycles is expected within the switch for e^^d^#6 QS|k)main. 

A root complex will begin sending^TypeO 6|G cycles to its southbound PCI-Express 
port. It will discover the switct^oi^iirectl^onnected to the switch. It will discover 
the switch as a PCI-bridge,^gd^^)theft^jites on that link. It will then initiate Typel 
CFG cycles to discover th^dev^ces^^inaMiat PCI Bridge on that port. Once inside the 
switch, the Typel CF^c^l^^riJl discover all ports and PCI Bridges that have been 
assigned to that OS domain, ^^hared port will appear as assigned to that OS domain. 



iiscifvtfes a shared port, it has no knowledge that the port is shared. 
Ip^ear to that root complex as a PCI-Bridge, the same as all other 
Vitclf implementation, there will be 16 OS domains possible. Each 



When a root cggy: 
As such, thjfpor 
ports. Ig tH| initial' 

root cxffn^le^^ll be mapped to one of those OS domains. This mapping will have a 
specific AS tunl^©el encoding as a 16 port switch. From the switch view, it will always 
have 



A 6 port switch as its link partner. 



For CFG cycles sent to a shared link, the CFG cycle will be encapsulated within the AS 
PEI8. This will be sent with the turn pool encoding assigned to the appropriate OS 
domain. The response to that CFG cycle will depend on the number of real OS domains 
supported by that link partner. For a given shared link partner, it will support a certain 
number of OS domains. The shared link partner will only respond to CFG cycles which 
are mapped to OS domains that it supports. 

In the following diagram, there is a 4 OS domain shared controller tied to the switch. 
The switch supports 16 OS domains, enumerated as a 16 port virtual AS switch. The 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 43 of 222 



NextIO, Inc. 

© All Rights Reserved. 



NEXSIS Overview Document 
V0.8 



Shared controller supports 4 OS domains, enumerated as a 4 port virtual AS switch. The 
switch always sends packets encapsulated to a 16 port virtual AS switch. The controller 
always sends packets encapsulated to a 4 port virtual AS switch. If a port ever sees 
packets encapsulated with turn pool beyond the range it supports, the packets are no 
responded to at the transaction layer. 



This method allows standard PCI discovery to work. In the example below, each OS 
domain maps to a given virtual AS switch port. Root Complex 1 initiates a Typel CFG 
cycle destined for the shared controller. The switch changes the Typel to a TypeO cycle 
at Port 10 prior to sending the CFG cycle to the controller. This CFG cyclers 
encapsulated to a single virtual AS port, and the controller responds by s* 
response on the corresponding return virtual AS port. 




If a Root complex was to exist that was mapped to a virtual AS^port th^tffi^pntf oiler 
does not have, e.g. virtual AS port 10, the controller would drdp^e packet, m the 
transport level, this would be the equivalent of a timeout the CP^rea^and the Root 
complex would assume no I/O device is present on that^tog^d PCI This mechanism 
allows all devices to be discovered in the same meth«5\with i^I/O device required to 
know the full fabric topology. 



Root Complex 1 
Shares 10G NIC 



Port 7 



Root Complex 3 
Shares 106 NIC 




|Root Complex 4 
Shares 10G NIC 




Root Complex 2 
Shares 10G NIC 



Port 5 



4 OS Domain 
Sharable 10G NIC 



Figure 2.4-14: Example of a Shared Switch 
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2.5 Error Handling 

Two aspects of a PCI Express switch make its error handling behavior much different 
than that of a PCI Bridge: 

• First, transient link errors can typically be corrected automatically by hardware in 
the Data Link Layer. This eliminates the need to report these errors by setting any 
bits in the PCI status registers and eliminates any problems produced when the 
failed packet was being forwarded on behalf of another device. This is because 
the packet will ultimately be transmitted correctly, so no error handling 
procedures beyond the scope of a single link need to be considered. Repeated 
failures will ultimately cause a fatal error to be recorded and the^ulty lin^will 
be shut down. 

• Second, non-posted transactions are managed with a^comf^ti^^^sage that 
eliminates the need for a forwarding agent to keej^ff^k of '^cpected response 
messages. So a switch can blindly forward transitions ^^hou^etermining the 
message type or worrying about mapping errors^ad^o the (filiating master. 

When a Data Link Layer or Physical Layer eiroi^vtejiich calrpgt be associated with a 
particular packet or OSD, the error is logged SQ MMM% J^eachBSD and error messages 
if any are sent to each port sharing the port wj|h aii^rrc 

The Nexis Switch implements the PCI^i^pre^Adv^ped Error Reporting Capability 
registers in addition to the PCI Expresshasi^ftor reporting registers and the legacy PCI 
mapped error registers. The Ad^#ced^^o^^^pSrting Capability registers provide 
detailed error logging, error maskihg|and erf% severity control registers. It also provided 
a header logging register forWie nfet unmasked error that's logged. Refer to the 
Configuration Registers s^ptfen f(Syetan||J^description of the Advanced Error Reporting 
Capability registers. 




r 



2.5.1 Error Tywes 

PCI Express en-b'^we%lS§§|^e(ras Correctable and Uncorrectable errors. Uncorrectable 
errors are f^rtMt^a^^ied as Fatal or Non-Fatal errors. Errors are also classified based 
on the source of tftg er^Tr as Transaction Layer Errors, Data Link Layer Errors and 
PhysiqafeJ^l^r Error|. The following sections describe each of the errors and how they 
are handling lmjhg^exis Switch. 

2.5.B||4Copfectable Errors 

Correctable errors are those which are localized to a single PCI Express link and can be 
automatically corrected by hardware. All correctable errors are automatically corrected 
by a retransmission of the faulty packet. An ERRCOR message reporting the occurrence 
of the error may optionally be sent to the root complex. The message is sent only if the 
error is not masked and the SERR Enable bit is set in the Command Register and Bridge 
Control register. 



2. 5. 1. 1. 1 Physical Layer Errors 
• Receiver Error 
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Physical Layer receivers may optionally check for errors and report them by sending 
an ERR_COR message to the root complex. Any DLLP or TLP being received that is 
in error should be discarded and any storage allocated made available. The error will 
be automatically corrected when a NAK DLLP is received and the packet is re- 
transmitted. 

The Physical Layer will check for disparity errors and invalid symbols. If any of 
these errors occur, the Physical Layer will report it. If the error (either one) is part of 
a valid TLP, the Rx Link Layer will send a NAK DLLP for the corresponding TLP 
and will not report an error since it was already done by the Physical Layer. If the 
error is detected in the Data Link Layer in time (within 3 clock cycleyrom the start 
of packet), the packet will be purged and the Rx Transaction Layer wUrttever seethe 
packet. If the error is not detected in time, the TLP will be forwarded to tH 
Transaction Layer with an error indication at the end of the packelj 



2.5.1. 1.2 Data Link Layer Errors 
• Bad TLP 

This error is set when the link layer detects a pacMfWittf 
o BadCRC % 



of the following. 




o Incorrectly nullified packet (TLPjfods ^%h^EOTf a but the LCRC is not 
inverted) 

o Incorrect packet sequence^ 

• Bad DLLP 

This error occurs when a CRQcsfaeclc 

• Replay Timer Timeout^ 

This error occurs when tfc^^EJX^A^^^p^tER has been exceeded by a given TLP, which 
occurs when no AC&|>r l^^lbLLI^'is received within that time period. This error is 
automatically corrected by.N^fcng the TLP and forcing a re-transmission. 



REPLAY 



V 



This error (^cur^^en^given TLP was unsuccessfully retransmitted REPLAY JNFUM 
times. Thi||conditio% is ^titomatically corrected by signaling the Physical Layer to 
retra^t^l^:. Oncepretraining is successful, the TLP can again be retransmitted (and 
REPiAY_NdMas^set). 



2.5^>%|Jnp.6rrectable Errors 

Uncorrectable Errors are those which disrupt the functionality of the PCI Express port but 
cannot be corrected by hardware. Using the Advanced Error Reporting Capability 
Registers each uncorrectable error can be configured to be sent with an ERRFATAL or 
ERRNONFATAL message to the root complex or can be masked off from sending a 
message. The error messages are sent only if the SERR Enable bit is sent in both the 
Command Register and Bridge Control Register. 



2.5.1.2.1 Physical Layer Errors 
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• Training Error 

This error occurs when a device fails to establish a link with its partner. An error 
message is sent to the root complex if enabled, and the link is taken down. 

2. 5.1.2.2 Data Link Layer Errors 

• Data Link Layer Protocol Error 

This error is caused when the TX MAC receives an ACK/NAK DLLP with a 
sequence number that does not correspond to any of the packets in the retry buffer. 

2.5.7.2.5 Transaction Layer Errors 

• Unsupported Request 

An Unsupported Request Error is generated for the following condi 



A request cannot be mapped to any address space mapped \\\ im \itic device oi 
to any egress ports. 

Downstream port of the switch receives a ^n^liuratiJflgijISst with Device 
number 1-31. The port will terminate the/trans^^n andlfiot issue it on the 
link. 




A packet if forwarded to an egress^ffS 
the link is in DL Down state. 




nsmitted across the link, but 



• ECRC Error 

Logging this error is optional. The l^e^^wil^will^hot verify the ECRC for forwarded 
packets. All transactions originatirs^om t^w^C(COP) will not contain ECRC. All 
transactions destined to the sw|t|h (T^|pl/TyjkO headers and Device specific registers) 
that have ECRC will not by^j3y ; Tl^^RC Generation Capable and ECRC Check 
Capable bit in the Advan^d E^or B^^iJity register is hardwired to zero. 



Malformed TlS^^S^ 
The re^ffiSRj&f liLP packets generates this event when an inconsistency in the 
formation of a%acl<et' i s detected at the receiver (destination). There are several 
CQ^itio^that r^juire detection and reporting, some others are optional. The table 
slow shd^i^l^tonditions that are supported and not supported by the switch. 



unformed Packet Errors 



Supported? 
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Data payload exceeds max payload size 


Yes 


Actual data length does not match data length 
specified in the header 


Yes 


Start memory DW address crossing a 4KB 
boundary 


No 


TD field = 1 but no ECRC 


No 


Byte enable violation detected 


Yes, only for writes to COP 


Packets with undefined type field r 


Vac 

Yes s 


Multiple completions with read data that 
violate RCB 


No * 4 \& 


Completions with a configuration request retry 
status in response to a request other than a 
configuration request 


NoOsthisoptioiSBW^^T 


TC contains a value not assigned to an enabled 
VC within the TC/VC mapping for the 
receiving device 




Transaction type requiring use of TCO has^C^ 
value other than zero. s 


fc^be V2) 

\ 


Routiner is inrnrrppt fnr trpnQnrtinn tvrfp^pi a*^^ 

IVUULIllgj 13 HICLJilC/d 1U1 II CU loot 11 \Jl 1 IV UC vV?rv 

transactions requiring routing t^R®lje.t£cfe^ 
moving away from RC) / ^ i|| 


*^r$i rnavHp fthic an 
^..> yj^yinciy uc I in is 10 cm 

t /ALM check) 


Msg/MsgD messages wit^OObl^uting 1 
received at upstream pjXEt %k t 


?? (ALM check) 


Msg/MsgD messag|j|wit^01 let|^util| 
received at dowQ^ream v pJrt* 


?? (ALM check) 



A malformed TE^wS4^|^dicarded and an ERRNONFATAL or ERRFATAL 
message m|j^b^^enWo the root complex. No Nak DLLP is sent in response to a 
Malformed|TLP, af|| tnfe^flow control credits are not updated. A Completion Response 
will no|||e %ued foi^ion-posted transactions with a malformed TLP. 



R^eiver dverflow 

'er may optionally check for Receiver Overflow errors (TLPs exceeding 
CREDITS_ALLOCATED). If this condition is detected, TLP(s) are discarded 
without modifying the CREDITS_RECEIVED and any resources that had been 
allocated for the TLP(s) are de-allocated. (Not supported right now in the FPGA) 
Completion Timeout 

When a non-posted transaction fails to return a completion message within the 
subscribed time limit, then a completion timeout error has occurred. The Nexsis 
switch will not master any non-posted transactions, and so will never generate a 
Completion Timeout error for any packets going off-chip. 
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• Completer Abort 

This error occurs when the Completer of a request is unable to process the request 
due to a component-specific error condition. There are no known conditions for 
which the switch will generate this error. 

• Unexpected Completion 

This error occurs when a completion message is received that cannot be matched to 
any outstanding requests. The switch will report this error only for the completion 
targeted to the switch as an endpoint. The ALM detects this condition and notifies the 
MAC to log the error. 



• Flow Control (FC) Protocol Error 




The receiver of Flow Control DLLP packets generates this event #lsgp a 
the flow control protocol is detected at the receiver (destination). The|l%i^several 
conditions that require detection and reporting, some othg|s%e^ optional. The 
following conditions may be checked and violations reported. 

During FC initialization for any Virtual Channel tl^ v^.must aStfe'rtise credits 
equal to or greater than the minimum for that FG^ge 

een fnade during 
be set to zero. 



• IF an Infinite Credit advertisement (value 
initialization, THEN any future update ciie'dit 




Poisoned TLP 

A poisoned TLP is one where thej|]^^lm&the header is set indicating that the TLP 
is known to contain an error. Ij^e pa^^is%|p^dy poisoned the switch will not 
issue an error message unle|§ it finS target of the poisoned TLP. 

If the error is an uncorr^tab^^C'^tonn the internal data buffers of the switch for 
one of the transit paqH||s th^ swlt|| 
forwards the pack@£ 

When a switclLforwaMs a pt&oned TLP, the receiving side must set its Detected 
Parity Error t^^^SPlil^nsmitting side must set its Master Data Parity Error bit if 
the ParitjPEr^^esgonse bit in the Bridge Control register is set 



the EP bit in the header, logs the error and 



2.5J^M K &^t-Thro^gh Error Handling 

If the cut_thri^^ibl4)it is asserted in the Nexsis Switch may, a TLP could be forwarded 
from|the ingrdjss port to the egress port before it has been completely received on an 
ingres^lj^rt^This complicates the error handling in the conditions where the TLP would 
otherwise be discarded. If an error does occur on a cut-through packet after it has begun 
transmission out an egress port, then the TLP must be 'nullified' to indicate to the 
receiving device that an error has occurred. A TLP is nullified by either using the 
inverted value for its LCRC or by signaling the physical layer that it must use an EDB 
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symbol instead of an END symbol as the final framing symbol. The ingress port returns 
a NAK DLLP to the TLP source and the egress port purges the packet from the Replay 
buffer. When the endpoint finally receives the TLP and detects the EDB symbol and the 
inverted CRC, it purges the packet and does not return a NAK DLLP. 

2.5.2 PCI 2.3 Error Reporting 

All PCI Express ports will update the error status bits in the PCI 2.3 configuration space 
as appropriate in order to maintain compatibility with legacy drivers. Note that these 
status bits are independent of any status maintained in the Advanced Error Reporting 
Status or control registers. In particular, setting or clearing an Advanced Jin* 01, Reporting 
Status bit should not clear the corresponding bit in the PCI 2.3 configura%m registers - 
these must be left to be explicitly managed by software and is performed by I^^O^ 

Each PCI Express port connects implicitly to a PCI-PCI virtual br^l^Eac^^ these 



bridges will implement a complete and independent PCI-PCI, 
space. 

Note that the primary bus of each bridge is the one clos 
this can vary depending on how the Nexsis Switch is^ltallec 

The following sections detail how the Nexsis Switc^shoulc 
compatible with legacy PCI 2.3 software. 



idge tWpn %$nfiguration 



>mplex, and that 
system. 

import errors to remain 




2.5.2.1 Primary side of P2P Bridg% X y- 

• Detected Parity Error y^K^^^J^ 

This error will be set when eyeteihe primary Ipfe of the internal P2P bridge receives a 
poisoned TLP. In the Nex$jls\^^ the jtx ingress MAC (Root Port) would set the 
Detected Parity Error bifejn rite PrFn^^Status Register when receiving a poisoned 

• Signaled System ,%rorW 
The TX MACof tf^ 
uncorrectable erri 



pam port must set the Signaled System Error if an 
5RR_FATAL or ERR NONFATAL) is transmitted. 

Receiv|flTO%r ^>ort 
This^rf^needs^|> be detected only by PCI Express device which originally initiates 
^^nSact^^^d^Ience not applicable to the Nexsis switch. 

||eceived Tj ar 8 et Abort 

Th^|rro^iieeds to be detected only by PCI Express device which originally initiates 
a trarfJMtion and hence not applicable to the Nexsis switch. Signaled Target Abort - 
hardwired to 0 since we will not abort any transactions as an endpoint. 

Master Data Parity Error 

This error is detected when forwarding a Poisoned TLP from the secondary side of 
the bridge to the primary side. In the Nexsis switch the TX MAC of the upstream port 
must set the Master Data Parity Error bit in the Primary Status register if the Parity 
Error Response bit in the Command Register is set. 
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2.5.2.2 Secondary side of P2P Bridge 

• Detected Parity Error 

This bit is set when the secondary side of P2P bridge receives a poisoned TLP. In the 
nexis switch the Rx MAC of the downstream port must set the Detected Parity Error 
bit in the Secondary Status register when it receives a poisoned TLP. 

• Received System Error 

The Rx MAC of a downstream port must set the Received System Error if an 
uncorrectable error message (ERR_FATAL or ERRNONFATAL) is received. 

Received Master Abort 



This error needs to be detected only by PCI Express device which oriptelly initiates 
a transaction and hence not applicable to the Nexsis switch. 1 " 7 

Received Target Abort 

This error needs to be detected only by PCI Express deyjg^^ch difginafly initiates 
a transaction and hence not applicable to the Nexsis sffitch. %3 ' 

Signaled Target Abort /f 

There are no known conditions for which the switcr^hould^rget Abort a 
transaction targeted to it. 

Master Data Parity Error / 

This error is detected when forward iri^^oi#dned T^bP from Primary to Secondary 
In the Nexis switch the TX dowr^fea^l^&^ wcpald set the Master Data Parity Error 
bit in the Secondary Status regi|fer whe%trar^fftting a poisoned TLP. 




2.5,3 Error Reporting 

Errors may be reported by%ilfier generating an explicit error message, or through the 
Completion Status field^jn a^Epmpletion header. A completion response is used for 
reporting errors^rl^l^^rom ^non-posted request, while explicit error messages are 
used for all o^^p^s of messages. Note that a Completion Status may only be used by 
the intende|targe^^ tft^cjriginating message. Thus, a non-posted message that is being 
routed ifcofeh the switch will never have a Completion generated by the switch, and so 
all e|forsrep^ed fojF that message will use explicit messages. The only messages that 
will&se the Co^lStion response to report errors will be those messages that are targeted 
to i n %gal registers of the switch itself. 

Note also that only Unsupported Request and Completer Abort errors are reported in a 
completion response. All other errors, even for non-posted requests, will generate an 
explicit error message. 

2.5.3.1.1 Completion Status Response 

The format of a completion header used to respond to an error condition is as follows: 



+0 


+1 


+2 


+3 
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7 


6 5 


4 3 2 1 0 


7 


6 5 4 


3 2 10 


7 


6 


5 


4 


3 2 


1 0 


7 


6 5 4 3 2 1 0 


R 


Fmt 
0 1 


Type 
10000 


R 


TC 


Reserved 


T 
D 


E 
P 


Attr 
0 0 


R 


Length 


Completer ID 


Compl. 
Status 


BC 
M 


Byte Count 


Requester ID 


Tag 


R 


Lower Address 



Figure 2.5-1: Format of Completion Header 

The following table shows the fields and their values for completion headers reporting 
error conditions: 



Field 


bits 


Description 


Value 


^7 

Variat>fi^ 


Length 


9:0 


Always zero for error 
completions 


o , 




R 


11:10 


reserved 






Attr 


13:12 


Copied from request header 




k / 




EP 


14 


Indicates TLP is poisoned >^|5t 




'No 


TD 


15 


Indicates presence of TLP/Sigfest ^ 


V „ — 


No 


Reserved 


19:16 


reserved """^SS^*** 


V >o 


No 


TC 


22:20 


Copied from request header ^is,^ 




Yes 


R 


23 


reserved J. %l s 


0 


No 


Type 


28:24 


Indicat&#Msg%ype , % < / r 


10000 


No 


Fmt 


30:29 


Indjc€^4 blbffno data 


ObOl 


No 


Byte 
Count 


43:32 


The remaiifegbyte count for 

gwS^y 




Yes 


BCM 




.V 


0 


No 


Compl. I 
Status^V 


|47:45 


^0 ^Successful Completion 
0|1 = Unsupported Request 
$10 = Configuration Request 

Retry Status 
100 = Completer Abort 




Yes 


Complelrfe 
ID 


sd:48 


Bus #, device # and function # of 
unit generating completion (and 
reporting error) 




Yes 


Lower 
Address 


70:64 


Unused 


0 


No 


R 


71 


reserved 


0 


No 


Tag 


79:72 


Copied from request header 




Yes 


Requester 


95:80 


Copied from request header 




Yes 
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ID 











Table 2.5-1 Completion Header Fields 
2.5.3.1.2 Explicit Error Messages 

Explicit error messages are generated for every correctable or uncorrectable error which 
is not masked in the Advanced Error Capabilities register. For shared ports an error 
message for an error from a port which cannot be associated specifically to an OSD (for 
example, Training Error or Reply Timer Timeout) should be sent to all the RCs sharing 
the downstream switch port. | 



Error messages generated are one of the following three types: 



Type 


Description ^ 


ERRCOR 


Issued when component or device det^/llk 
correctable error on the PCI Expres^C * 
interface ^\ %^ 


ERRNONFATAL 


Issued when the component ondl^ce .< 
detects a Non-fatal, uncopgi^^e Jk^r on * 
the PCI Express interface,, *V 


ERR_FATAL 


Issued when the cor^pnerUor ctevice 
detects a Fatal, uncor^ta^e ef^rol^on the 
PCI Express int«^/%, J 




The format of all error message s%,sho 




he following table: 



+0 




+2 


+3 


7 


6 5 


4 3 2 






''3 2 10 


7 


6 


5 4 


3 2 


1 0 


7 6 5 4 3 2 1 0 


R 


Fmt 
0 1 






TC* 


Reserved 


T 
D 


E 
P 


Attr 
0 0 


R 


Length 
0 


%fcReque|ter ID 


TagO 


Message Code 




Figure 2.5-2: Format of Error Messages 



All error messages are 16-bytes in length, with the following fields and values: 



Field 


bits 


Description 


Value 


Variable 


Length 


9:0 


Unused - always zero 


0 


No 


R 


11:10 


reserved 


0 


No 


Attr 


13:12 


Attributes - always zero 


ObOO 


No 


EP 


14 


Indicates TLP is poisoned 


0 


No 
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1 D 


1 C 
1 J 


Indicates presence of TLP digest 


A 

u 


INO 




Reserved 


1 O- 1 A 

\y. 10 


reserved 


U 


iNO 


TP 




i raiiic ciass, musi oe zero 




INO 


R 


23 


reserved 


0 


No 


Type 


28:24 


Indicates 'Msg' type 


10000 


No 


Fmt 


30:29 


Indicates 4 DW header, no data 


ObOl 


No 


R 


31 


reserved 


0 


No 


Message 
Code 


39:32 


0011 0000 = ERR_COR 

001 1 0001 = ERRNONFATAL 

0011 0011 =ERR_FATAL 




Yes / 


Tag 


47:40 


Unused (no completion required) 


0 , 






Requester 
ID 


63:48 


Bus #, device # and function # of 
unit reporting error 








Table 2.5-3 Error Message! 



1 The sp ec r equires all error messages to use 



a 4 DW header. This Error Message field is takprffrom Ei^^feog Register where the header 
of the first packet in error is stored. 



2.5.3.1.3 Message Routing 

The three Isb's of the message typ^filf^ndicati 
always 4 000' for error messages ^hdicaffeg me 



e cfressage routing mechanism, which is 
message should be routed to the Root Complex. 

2.5.4 Header Loggii 

As part of the Advanced E#)r Reporting capabilities the Nexsis switch logs the TLP 
header for the f^t^unc6^i:ecm1|^ transaction layer error reported. Headers are logged 
only if the mask^lfi^^fiiij^cprresponding error is not set in Uncorrectable Error Mask 
register an<Lriiietl^dk^ctable"Error Status bit pointed to by the First Error pointer is not 
set. Header! are logged i%a 4 DWORD register for the following errors. 

ns^xed TLFjreceived 

ECRC CtteTelf Failed ( error is not supported) 

ported Request 

letion Abort (error is not supported) 
Unexpected Completion 
Malformed TLP 




There are no variations in header logging logic for shared or non shared port. 
2.5.5 Error Tables 
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The following table lists the errors that may occur and be detected on a single PCI 
Express link. 



Layer 


Error Name 


Default Severity 


Detecting Agent Action 


P . 


Receiver Error 


Correctable 


Send ERR COR to Root Complex 
(RC) 


Training Error 


Uncorrectable 
(Fatal) 


Send ERR_FATAL to RC 


D 


Bad TLP 


Correctable 


Send ERR COR to RC 


Bad DLLP 


Send ERR COR to RC/ 


Replay Timeout 


Send ERR COR to RlS^ ^ 


REPLAY NUM 
Rollover 


Send ERR CORtoRC '% P 


Data Link Layer 
Protocol Error 


Uncorrectable 
(Fatal) 


Send ERR^FATAktyftW' 

X 


T 


Poisoned TLP 
Received 


Uncorrectable 
(Non-Fatal) 


Send^R. NO%A r ML to RC 
LgJheatkofTLf^ 


ECRC Check 
Failed 


^endERRltoNFATAL to RC 
ioAader ofTLP 


Unsupported 
Request (UR) 




*&enf lilR_NONFATAL to RC 
L*bg header of TLP 


Completion 
Timeout 


' Se.nd^RR NONFATAL to RC 

/ 


Completer Abort 


r Send ERR NONFATAL to RC 


Unexpected 
Completion A 


Send ERR_NONFATAL to RC 
Log header of Completion 


Receiver Overfl^y 


^nc^^Ble 


Send ERR FATAL to RC 


Flow Contrail ^ 
Protocol ErrorS^ 


Send ERR_FATAL to RC 


Malfonf»B%^ 


Send ERR_FATAL to RC 
Log header of TLP 




Table 2.5-4: PCI Express Link Errors 

Occyfrente l^ny ojfthe above correctable errors can be flagged in the Correctable Error 
Status Register ^SFmasked in the Correctable Error Mask Register. 

Occu^nce of a|iy of the above uncorrectable errors can be flagged in the Uncorrectable Error 
Status Register^nd masked in the Uncorrectable Error Mask Register. Additionally, the 
uncorrectable errors can be programmed to be reported as either a fatal or non-fatal error by use 
of the Uncorrectable Error Severity Register. These registers are replicated for each OSD. 

2.5.5. LI Error Signaling and Logging 
Legend: 

Type: C=correctable, NF=non-fatal, F=fatal - indicates type of error message that is generated 

S= Supported N=Not Supported 

Italicized errors correspond to specific error bit set 
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2.5.5.1.2 Rx Physical Layer Errors 



Error 


S/N 


Packet Behavior 


Reported Error 


Type 


Invalid symbol or 
disparity error 


S 


Discard pkt if error is part of 
pkt. Schedule Nak if TLP 


Rx Error 
Rx Port Error 
reported by PLM 
(does not report if 
it was part of a pkt 
or not) , 


C 


Link Error 


s 




Receiver E4k&r 


C 


Training Error 


s 


N/A 


Training Erroi^^ 





Table 2.5-5: Rx Physical Layer Errors 



2.5.5.1.3 Rx Data Link Layer Errors 
Table 2.5-6 Rx Data Link Layer Errors 




Error 


S/N 


Packet Behavior N 


Reported Error 


Type 


Invalid sequence 
number on ACK 
or NAK DLLP 


s 


Discard DLLP _ V 


^Protocol 
Error 


F 


Duplicate Seq. 
Number on TLP 


s 


Discard TLP^te^lle A®^ 


No error 




Unexpected Seq. 
Number on TLP 


s 


Disca^TLPT^d^^' 


Bad TLP 


C 


LCRC error on 
TLP 


s 

A 


^DTs^^ T^|u^fess it's cut- 
througnk. Scjreciule Nak 

feD^LP. V 


Bad TLP 


C 


LCRC error on 
DLLP j®. 




li^rd DLLP 


Bad DLLP 


C 


DLLP w/ _ 
unsupported^p^^ 
encodings jjk 


V 

i v 


discard DLLP 


No associated 
Error 




FCIniffMfc 
violations 




N/A 


DLL Protocol 
Error 


F 


Rx Framing J 
Violat1%.s^ 


s 


Discard pkt, send Nak 


Bad TLP 


C 


TLP with EDB 
and inverted 
LCRC 


m 


Discard pkt. No Nak 
scheduled 


No associated 
Error 




Nullified TLP 
without EDB 


S/N? 


Discard TLP 


Receiver Error 


C 



Table 2.5-7 

2.5.5.1.4 Rx Transaction Layer Errors 



Error 



Opt 1 Packet Behavior 



Reported Error [ Type 
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Unmapped address 


S 


Discard TLP 


Unsupported 
Request 


F 


FC overflow 


N 


Discard pkt. No Nak, No FC 
update. 


Receiver Overflow 


F 


Malformed TLP 1 


S 


Discard TLP, No Nak, No 
FC update 


Malformed TLP 


F 


Pkt length field < 
actual pkt length 


S 


Truncate pkt (or discard if 
S&F). 


Malformed TLP 


F 


Pktlength field > 
> actual pkt length 


s 


Stop at END. Discard if 
S&F. 

No Nak, No FC update 


Malformed TLP 


F 


Unexpected 
Completion 2 


s 


Discard completion 


Unexpected 




Request violates 
programming 
model of Rx device 


N 


Discard request 

jt, 


GbmplehrJtb^Bfy' 


NF 


Rx device unable to , 
process request due 
to device-specific 
error condition 


N 


Discard request j\ ^ 

\ 

y%X^* 


\Qomplifer Abort 


NF 


Advertising more 
than 2048 credits 
for payload and 128 
for header 


» 




Flow Control 
Protocol Error 


F 


Did not advertise 
FC credit values >= 
min defined in 
Table 2-27 of PCI 
Express spec 


HI 

£ 

V 




Flow Control 
Protocol Error 


F 


Non-zero credit >8®B 
values recei^^ * 
after infinite creditlk 
advertised fk 1 


V 

i y 




Flow Control 
Protocol Error 


F 


statil ^te^ 




Return completion as 
Unsupported Request, 
discard request 


Unsupported 
Request 


F 


Receiv^^is^ied 
TLP 


S 


Pass TLP thru, unless 
directed to switch, then 
discard (and return UR for 
non-posted requests) 


Poisoned TLP 
Received 


NF 



See conditions for malformed TLPs under section □ Malformed TLP on page 47. 
2 Nexsis Switch will never expect completions, so can just ignore them and let them pass 
thru switch. 



Table 2.5-8 Rx Transaction Layer Errors 
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2.5.5.7.5 Tx Data Link Layer Errors 



Error 


S/N 


Packet Behavior 


Reported Error 


Type 


REPLAYNUM 
rolls over 


S 


Retrain the link 


REPLAY _NUM 
Rollover 


C 


REPLAYTIMER 
expires 


s 


Retry entire buffer 


Replay Timeout 


C 



Table 2.5-9 Tx Data Link Layer Errors 

2.5.5.1.6 Tx Transaction Layer Errors 




Error 


Opt 


Packet Behavior 


Repotted Error%£ 


^type 


Invalid TC, OSD 
or type 




Nak the transaction coming 
from the Switch Core. A 






TLP length 
exceeds 

Max Pay load Size 




Discard TLP JT 
(for TLPs with data payload^ ^ 
only) /\ 


s \ v 




Actual packet 
length is greater 
than pkt length 
field. 




Truncate and nullif^^^^^ 


Mplfbrmed TLP 




Actual packet 
length is less than 
pkt length field 




Nullify gl^^^^^ ' 


Malformed TLP 




Completion 
Timeout 1 


4 


^^^^^^i^^^nding request 


Completion 
Timeout 


NF 



The Nexsis Switch wiU ; ^otl|s^e%ny R<|fuests requiring Completions, so no timeout should 
ever be enabled ^ ^ 





jle 215-10 Tx Transaction Layer Errors 



2.6 f QOJ 

The||are two^Btsie^Components to Quality of Service in our switch. First, the Traffic 
Class^JC) fiefd in the transaction allows the driver to differentiate certain transaction 
flows ffi^pe them be mapped into a Virtual Channels (VC). So a particular message 
might be labeled with a TC of 7 for high-priority, while standard memory reads and 
writes might be labeled with a TC of 0. Our switch supports all 8 VCs. Second, our 
switch allows different OS Domains to share an I/O port, and each OSD will 
automatically receive its fair share of the bandwidth when that port is congested.. 



2.6.1 TC/VC Mapping 

The PCI Express spec lists a 3-bit Traffic Class (TC) field that is present in the 
transaction header. This field is used to differentiate traffic so that it can be prioritized, 
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queued, and scheduled independently inside the various PCI Express devices. Although 
the spec says that anything other than TCO is optional, our switch will accommodate all 8 
traffic classes. 



TC's are mapped onto Virtual Channels (VCs) in a vendor-defined manner according to 
the PCI Express spec. For our switch, each incoming TLP will have its TC field mapped 
into a VC based on software configuration settings. TCO must always be mapped to 
VCO, but all other TCs may be mapped to VCs without restriction individually on a per- 
port basis. 

Each RX MAC will have a TC mapping table that's just a flat-mapped 1g61|ujd table that 
turns an OSD and TC into a Source QID. The Source QID is needed to knov^hich^f 
the 16 flow control resources were charged with the credits of the inSfe^g ; TUf 

Once the destination port and destination QID are returned fcdnMie V 
addressjookupjnodule, the transaction can actually be (ji|pued up%pw tjiat all the data 
for the transaction is known. 4^ 





2.6.2 Arbitration points 

There are three arbitration points inside our s^itcPp^^ilj|^ supported, two in the 
transaction_scheduler and one in the datajriSveV. Irr&e transaction_scheduler, there are 
1 7 sets of arbiters that run independentlyf(^^^erau}piit port. Each set of arbiters 
will ensure each input port is allowec^fe| 



each OSD is allowed its fair share^ 
discussed in the next few sectiorfs. 



well. 




ie output port's bandwidth, and 
levels of arbitration will be 



2.6.2.1.1 Port arbitratigh (tmns^giof^cheduler) 

Port arbitration is the first st^p/fun witfin each port_arbiter at an output port (shown as 
ARB1 in the diagram orhfte %xt page). This level of arbitration will use a simple RR 
scheme to make^ii^Q^^y s^^ved and the bandwidths are balanced. This RR 
scheme is fix|4ia h^war*% WRR is supported. 



2.6.2.1J @£D/V0mrbifration (transaction scheduler) 

Sinc^there a^^6oi^put buffer groups for each set of arbiters, a second level of RR 
arbiigation (A^B2fto pick which transaction will be selected as the next one to be 
transl&tted on Ja given output port. This RR scheme is fixed in hardware, no WRR is 
supporrclfe^ 

2.6.2.1.3 Input arbitration (data_mover) 

The data_mover is the final stage (ARB3) that selects which input port is actually 
allowed to transfer data to each output port. Another level of RR is run to make sure 
each input port is serviced when more than one input port has data to send to a given 
output. 



Note that the datajnover will skip over the "best" choice from the transaction_scheduler 
if a different input port that is idle is able to move its data instead. In this case, a skip 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 59 of 222 



NextIO, Inc. 

© All Rights Reserved. 



NEXSIS Overview Document 
V0.8 



flag will be set such that the data_mover will not continue skipping the "best" choice 
more than once. Without this skip flag, one output could get very unlucky and get 
skipped over and over, stalling a transaction potentially indefinitely under certain traffic 
loads. 

The following picture shows these levels of arbitration. 



transaction scheduler 



output_port_0 



OSD0/VC0 
(group 0) 




data mover 



Figure 2.6-1: Transaction Scheduler Block Diagram 
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2.7 Syst m L vel Manag m ntExarnpl s 
2.7.2 Hot Plug Add of I/O Device 

Our device will integrate a hot-plug controller, allowing system software and/or chassis 
management software access to several registers to control hot-plug functions. These 
registers are implemented in the PCI-Express capabilities structure and can be used to 
turn on and off LEDs, exchange information with the Power Controller in the chassis, and 
report the status of the various slots in our system. 

Once a link has been trained,- OSDs have been negotiated (if necessary), 
initialization is complete, the hot plug process can begin. 

For hot insertion, this process will start with the user pressing the At^%ffQgBurfSfwhich 



will set a register bit in our switch for those P2P bridge headei^The <I|QP ^SJ^Send an 
MSI up to the hot plug services software running on each QS^on^ch O^D that share the 
newly-plugged-in I/O device. The hot plug services sofbtifre will Sl^bl|in device 
discovery. ^\ 

2.7.2. LI Hot-plug messages 

There are 7 different hot-plug messages defipM.in^T : Efpfess. Every message of this 
type will be routed to the COP for processing (a^ress^ookup_module with match the 
bridge header space) if the message is addr^|p^ne^Mhe P2P bridge headers in our 
switch. These are only required if t^e^ff^^aii^^lemented on the downstream I/O 
device. If the LEDs are directly b^he switc||(upTg GPIO), these messages won't be 
used. . J$L % c 






2.7.2.1.2 Attention By^^on^r eh 

The Attention Button/B^ess^register f)it must be set to a 1 if our switch receives this 
message from a downstream p&i&^If our switch detects the attention button was pressed 
for a given slot (^t^^lJipl^^^c^r switch, one for each port), our switch will generate an 
MSI and seiid4t^|j. to^he hot plug services software. 



2. 7.2J^J^ntiot^n3icator_On, Attention_Indicator_Blink, 
§ Amntion/ Indicator Off 

The^three messages will be generated by the COP and sent downstream depending on 
the sta!|&of the^attention indicator bit. 

2.7.2.1.4 Power JLndicatorjOn, PowerlndicatorBlink, Power_Indicator_Off 

These three messages will be generated by the COP and sent downstream depending on 
the state of the powerjndicator bit. 
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Blade 0 Blade 1 Blade 2 Blade 3 Blade 4 Blade 5 Blade 6 Blade 7 




"bob « f|S E ' « 



rm Mi csd 

oil q 

will *« 



[=1 IBSSl ■ , — 



i.'.m piwi JIB L>x^a f.«t ji 











.. .. : : i-:;: : 
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PCI-Ex 






PCI-Ex 






PCI-Ex 




REVIO 






REVO 






REVO 
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Busl9 



Figure 2.7-2: Logical Bus View 
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